After the success of “Why I chose Dell EqualLogic and not HP Lefthand” I thought I’d write another quick comparison blog.
Why did I think it appropriate to spent £8,000 of someone else’s money on buying licensed copies of VMWare when I could have effectively done exactly the same setup with free Microsoft offerings? It’s a good question I think you’ll agree, but the answer may not be as satisfying as you might like!
There have actually been quite a few authors before me who have tackled this subject, and I’ve linked to a few at the end of this post who cover the issue in more technical detail than I care to here. Instead the focus that I’m going to take is business-centric: What does the business want from this hugely technical solution we’re building? Which option is going to deliver what they need most effectively?
So, as I’ve mentioned before let’s dive back down to the original requirements, what’s spurred this project on and what’s the main driving force from the business? Availability. The business wants these systems to work more often than the currently do, it’s as simple as that. The current systems fail because they’re hand-built and they’re old, disks fail, when the servers reboot they get stuck at RAID card screens and BIOS questions and when people come into work on Monday morning they’re not all up.
This links into the second requirement (which is one of the ones that we call “non-functional”): managability. The solution that’s given must be managable by the people that do the managing, be it an IT Service Desk, or a single guy working in an office of 4 employees, it must be the right-size and right-complexity for the team managing it. In this example there’s an IT Service Desk, but it’s just lost a few members of staff and could do with life being a bit easier in terms of time spent managing the systems; they’ve got the technical competence to do it all themselves, but not the time.
Bearing in mind these two overrriding requirements let’s have a quick look at the differences (including high-level technical differences) of the two approaches:
Microsoft Failover Clustering
With Microsoft Failover Clustering you have two servers sharing a role (let’s say, File Server). At any given point one machine is in charge of the role and should it fail the other one takes over. Clients don’t see a difference when this happens and they keep accessing the same data point with no disruption. This achieves everything we want in terms of redundancy: if a single host (physical machine) fails then we’re safe, if the operating system becomes corrupted (say, a service pack install goes wrong) then we’re also safe.
From a complexity point of view, and therefore a management point of view though, all is not quite as good. For each service you want to configure you need to configure two separate Windows boxes (virtual or physical), requiring twice the setup time and twice the management overhead when it comes to keeping them up to date and managed. You also need a witness (or quorum) disk to keep the two machines in lock-step, which uses more resources and requires more configuration, and of course you need to configure and maintain the microsoft cluster itself and the virtual networks that keep that system in place.
VMWare Standard
Yes, specifically Standard because it’s the first version that gives you VMWare High Availability (HA). With HA you have a single server offering a role, if the (physical) host fails the machine will restart automatically on another host, giving the users a few minutes of downtime while the server starts up again. If a service pack install goes wrong and you corrupt your OS then you’re screwed, you’ve lost it, unless you had the foresight to create a snapshot and test the new service pack first before rolling everything up into production.
From a complexity point of view though, everything’s the other way round: there’s only one machine to setup and configure, only one set of disks in the file server to add (no witness disks or multiple accesses to set up) and creating the HA cluster is as easy as “right-click, add” for your (two or more) physical hosts. Additionally VMWare offers a huge advantage in terms of ongoing management in that you can use a concise central administration console to manage your entire infrastructure and server estate in the form of vCentre.
Comparing the two…
On the surface of it, it would seem (and indeed my personal intial reaction was), “Gee, the Microsoft offering seems better, it’s free and it does MORE in terms of availability than VMWare Standard does”. And that’s right, in fact to get the same level of uptime you’d have to buy the next version up of VMWare which supports vMotion.
However, the ability to manage all my hosts and virtual machines from a single interface was a big appeal to me. Knowing that as other systems are upgraded they can be added to the same administration console, thus simplifying the management of the entire estate into one window really sold VMware to the business, and it meant more than two minutes of downtime if a host failed.
On top of those fundamental reasons there’s also a shift in the way that high availability architectures are being built (see link at the end), and although Microsoft’s clustering (failover or otherwise) has been around for ages, their current push on HA is very new, and not that well documented or experienced whereas VMWare are established in the marketplace. No-one knows which way Microsoft will go with this (they could give up, try to buy out VMWare, or try to match the functionality) but at the time of writing VMWare are still a safe bet in comparison and the community support for them is considerably higher.
It should come as no surprise (at least from the title of this post) that I ended up going with VMWare in this instance, that might not be the case if I repeat the exercise a year from now, but that’s what happened this time.
Read also: