In my day job as a VMware Architect I often come across strange problems which require some internet based research to solve. I generally assume that if I have a problem there’s a good chance that somewhere, someone else will have experienced the same or a similar problem and documented the solution. In this case I found almost nothing online so here’s a writeup!
I’ve been building a small SRM 5 implementation (licensed for 75 VMs) for a client who use NetApp FAS at both the primary and recovery sites with SnapMirror to replicate the data. The client have Fibre Channel disk at the primary site replicating to a single filer head with SATA disk attached at the DR site (with some additional fibre channel disk for VMs that are always powered on). When we first ran a simulated test we found that after the first 5 or 6 VMs had powered on the majority of the remainder experienced timeouts waiting for VMware Tools to start and looking at the VM console it was obvious that the VMs were hitting a disk bottleneck. The actual cause of the disk bottleneck wasn’t quite as simple as the fact that we were using SATA disk to recover from (this could form a completely separate writeup on the joys of deswizzling) but it was a big part of the problem.
The client has an RTO (Recovery Time Objective) of 24 hours and SRM was completing the recovery plan within 1 hour. Even considering that there is additional work to fit into that 24 hour period it would be perfectly acceptable for SRM to take up to 10 hours to recover the VMs. While SRM was only taking 1 hour to recover the VMs we were seeing failed services in the Windows OS on the VMs and, since the task that waits for VMware tools was timing out after 300 seconds, VMs in subsequent priority groups were being powered on before their pre-requisite VMs were fully booted.
It was obvious that we needed a method to throttle the power on of the VMs. I started by adding a timing script between the priority groups that effectively paused the recovery for a preset amount of time however the number of VMs in each priority group was still enough to overwhelm the storage even with a lengthy wait between priority groups.
It was while searching for a method to more effectively throttle the power on that I came across the following articles from VMware:
Both of these posts mention the advanced option defaultMaxBootAndShutdownOpsPerCluster which can be configured in
Specifically the KB article states:
“This option specifies the maximum number or concurrent power-on operations performed by SRM for any given cluster. Guest shutdowns (but not forced power-offs) are throttled according to this value. VMware uses guest shutdowns during a primary site shutdowns (planned failover) and IP customization workflows. You can also set this option per cluster in vCenter -> DRS options: srmMaxBootShutdownOps. By default, throttling is turned off.“
There is a suggestion within one of the links of setting this option to 32 but without much detail on the actual effect of implementing this. In my case we are not working at the limits of SRM but with very capable ESXi hosts booting VMs from IO limited storage.
Neither article is particularly clear on exactly what this option does so here’s an example from my experience:
For me the best way to implement this option was to add srmMaxBootShutdownOps as an advanced DRS option for the DR cluster. When srmMaxBootShutdownOps is set to 4 we get the following behaviour. During an SRM power on when SRM starts to power on a priority group only the number of VMs specified by srmMaxBootShutdownOps will be powered on. With the setting at 4 the first 4 VMs in the priority group will be powered on with the remainder waiting until either the VMware Tools on those powered on VMs responds or times out. As a VM completes it’s power on another will start so that there are always 4 VMs powering on.
For situations where your ESXi hosts were the bottleneck rather than the shared storage the defaultMaxBootAndShutdownOpsPerHost option, edited in the vmware-dr.xml might be a better option.
For this client the srmMaxBootShutdownOps option when applied to the DR cluster provides a very granular throttle that allows us to slow down the SRM power on. For reference implementing srmMaxBootShutdownOps with a value of 4 has increased our recovery plan run time from 1 to 2 hours, still well within the 10 hour target.