I deployed a BCDR solution for one of our customers based on vSphere replication and Site Recovery manager 8.x so that they will be able to protect the workloads running in site A and recover them to site B should site A goes down.
Like most of the customers, and to ensure a secure deployment, they have a policy to open only the required ports between the different appliances/servers in the deployment while blocking all others.
For that requirement, we followed the below link to open all the official published network ports list to have a functional SRM deployment. https://docs.vmware.com/en/Site-Recovery-Manager/8.2/com.vmware.srm.install_config.doc/GUID-499D3C83-B8FD-4D4C-AE3D-19F518A13C98.html
When you login to any of the vCenter servers in any site, and navigate to Home > Site Recovery, all seems to be fine (GREEN is Good).
However if you click on OPEN Site Recovery, for HO site for example, to start the configuration and and attempt to pair both sites, you will be greeted with a nice error :).
Failed to retrieve pairs from extension server at https://SRM-DR-FQDN:9086/vcdr/vmomi/sdk. Failed to connect to Site Recovery Manager Server at https://SRM-DR-FQDN:9086/vcdr/vmomi/sdk. Reason: https://SRM-DR-FQDN:9086/vcdr/vmomi/sdk invocation failed with “org.apache.http.conn.ConnectTimeoutException: Connect to SRM-DR-FQDN:9086 [SRM-DR-FQDN/x.y.z.w] failed: connect timed out”
So, Obviously it seems to be a blocked port issue causing the connection to timeout. The interesting thing is that if you check the above network ports list that need to be opened, you can notice that port 9086 needs to be opened only between the vCenter server and the target SRM server, and between both SRM servers in both sites. We checked this and it was opened.
After long troubleshooting with the security team, we discovered that the traffic from the HO-site’s vSphere replication appliance to the DR-site’s SRM server needs to be allowed on port 9086, and in the reverse direction the traffic from DR-site’s vSphere replication appliance needs to be opened on port 9086 to the HO-site’s SRM server.
Whenever we opened port 9086 in these two directions, the error disappeared and we were able to continue with site pairing configuration.
The interesting thing here what triggered me to write this article is that this communication on port 9086 between vSphere replication appliance and the target SRM server is missed and not clearly documented in the official network ports list article. https://docs.vmware.com/en/Site-Recovery-Manager/8.2/com.vmware.srm.install_config.doc/GUID-499D3C83-B8FD-4D4C-AE3D-19F518A13C98.html
I strongly recommend that this “Network Ports for Site Recovery Manager” list gets updated stating clearly these 2 network flows on port 9086 to avoid any confusion in future SRM deployments.
I hope that this post helps others save troubleshooting time and effort in getting such an issue resolved.
Thanks for reading,