Backup in the Cloud – Part 1 – The Problem

I live in a not especially spacious flat in London and, due to the my technical nature and a strong desire not to fill my flat with any more stuff, I try to store as much digitally as possible. As soon as I started to rely on computers to store important documents, photos, music etc. I realised that I needed some sort of resilience  Mirrored disks helped to protect me from disk failure however, in 2008 I was burgled. The kids who broke in were, luckily, only interested in portable electronics so my ancient grey server case was of no interest to them. Once I realised that this grey box contained irreplaceable data I knew it was time to start some kind of backup routine.

With the best will in the world there was no chance that I would remember to perform regular backups and store them with a friend or family so I started looking at something that would be automatic, reliable, secure and offsite.

Continue reading “Backup in the Cloud – Part 1 – The Problem”

Throttling VMware Site Recovery Manager 5

In my day job as a VMware Architect I often come across strange problems which require some internet based research to solve. I generally assume that if I have a problem there’s a good chance that somewhere, someone else will have experienced the same or a similar problem and documented the solution. In this case I found almost nothing online so here’s a writeup!

I’ve been building a small SRM 5 implementation (licensed for 75 VMs) for a client who use NetApp FAS at both the primary and recovery sites with SnapMirror to replicate the data. The client have Fibre Channel disk at the primary site replicating to a single filer head with  SATA disk attached at the DR site (with some additional fibre channel disk for VMs that are always powered on). When we first ran a simulated test we found that after the first 5 or 6 VMs had powered on the majority of the remainder experienced timeouts waiting for VMware Tools to start and looking at the VM console it was obvious that the VMs were hitting a disk bottleneck. The actual cause of the disk bottleneck wasn’t quite as simple as the fact that we were using SATA disk to recover from (this could form a completely separate writeup on the joys of deswizzling) but it was a big part of the problem.

The client has an RTO (Recovery Time Objective) of 24 hours and SRM was completing the recovery plan within 1 hour. Even considering that there is additional work to fit into that 24 hour period it would be perfectly acceptable for SRM to take up to 10 hours to recover the VMs. While SRM was only taking 1 hour to recover the VMs we were seeing failed services in the Windows OS on the VMs and, since the task that waits for VMware tools was timing out after 300 seconds, VMs in subsequent priority groups were being powered on before their pre-requisite VMs were fully booted.

Continue reading “Throttling VMware Site Recovery Manager 5”