Throttling VMware Site Recovery Manager 5

In my day job as a VMware Architect I often come across strange problems which require some internet based research to solve. I generally assume that if I have a problem there’s a good chance that somewhere, someone else will have experienced the same or a similar problem and documented the solution. In this case I found almost nothing online so here’s a writeup!

I’ve been building a small SRM 5 implementation (licensed for 75 VMs) for a client who use NetApp FAS at both the primary and recovery sites with SnapMirror to replicate the data. The client have Fibre Channel disk at the primary site replicating to a single filer head with  SATA disk attached at the DR site (with some additional fibre channel disk for VMs that are always powered on). When we first ran a simulated test we found that after the first 5 or 6 VMs had powered on the majority of the remainder experienced timeouts waiting for VMware Tools to start and looking at the VM console it was obvious that the VMs were hitting a disk bottleneck. The actual cause of the disk bottleneck wasn’t quite as simple as the fact that we were using SATA disk to recover from (this could form a completely separate writeup on the joys of deswizzling) but it was a big part of the problem.

The client has an RTO (Recovery Time Objective) of 24 hours and SRM was completing the recovery plan within 1 hour. Even considering that there is additional work to fit into that 24 hour period it would be perfectly acceptable for SRM to take up to 10 hours to recover the VMs. While SRM was only taking 1 hour to recover the VMs we were seeing failed services in the Windows OS on the VMs and, since the task that waits for VMware tools was timing out after 300 seconds, VMs in subsequent priority groups were being powered on before their pre-requisite VMs were fully booted.

Continue reading “Throttling VMware Site Recovery Manager 5”