Plans for a hosted TFS offering are coming along both on a business and technical front. I had hoped to be online by now but I’ve been told that getting a business up and running always takes twice as long and costs twice as much as you first thought – now thats wisdom for you!
One of the things that I had on my TODO list was to look at backup and recovery of a TFS hosted server in a virtualised environment. I’ve pretty much come to the conclusion that the only way to go with TFS (hosted or otherwise) is virtualised. While I recognise the performance hit is real, especially in relation to larger changeset operations the operational benefits are just to compelling to ignore.
As part of my experimentation I have created a virtualised TFS environment and exposed it directly to the Internet. The virtualised environment includes a AD server, a TFS server, and a Team Build server all running on one physical machine. Once I had the environment established I then had the challenge of figuring out how I would take backups.
I decided what I would do is write a PowerShell script which would “save state” on each of the virtual machines and copy off their differencing disk and state to another drive. The whole process takes about five minutes and disaster recovery is a (practiced) ten to fifteen minute job.
When I wrote the script I decided that should the copy fail I wouldn’t restart the virtual machine. The philosophy is that if something is broken I want it to scream loudly – and nothing screams more loudly than a user (in this case me) reporting that a system is unavailable.
To prove the point I didn’t write in any disk space culling mechanism into the backup script so the drive that I was copying the differencing disks to would slowly fill up. When it happened the first time none of the virtual machines started up and the e-mail report from the backup script let me know what the situation was.
I then decided to write a second script that ran a little earlier in the day that would go through and trim off some backups. Actually – the first version of this script had a bug where it wasn’t actually doing any trimming and so I got another e-mail about the backup failing (and the system was stopped) today.
This must all sound pretty reactive, and you would probably be horrified if someone decided to let systems go down as their warning that something in the environment is wrong. You are right – but as a safety net it is quite effective and it is better that a system goes down because it detected that maintenance wasn’t being performed correctly rather than a system crash where you loose data.
My next step is to get the Readify IT manager (also a TFS expert) to consider ways that we could identify issues before they happen. I’m pretty sure I could do it with some PowerShell scripts 🙂