Public Cloud Anxiety – Dealing with maintenance & failures of VMs

I had an interesting call with a partner this week where the client of the partner did not want the servers (virtual machines aka VMs) deployed to the public cloud (in this case Azure) because of potential downtime. Their fundamental concern was due to planned maintenance where the host systems are patched & rebooted as needed which can lead to downtime. This is one of those good and bad things. The good being the host systems are patched and up to date (good for security & overall system health) but yes bad because it means that the system could be down for a short period of time, but only if you only run single instances!

Back when I used to run systems locally uptime of a single server was a matter of pride. We had Windows & Linux boxes that were literally up for years without ever being updated or restarted. However, like many things in life, now I know better ;)

The good news is you can get the best of both worlds, but it does require a little planning, a little research and will cost a little more, but it’s well worth it if you are looking for an SLA in the 99.95% range or more.

Multiple VMs

The first thing you need to do is decide can & should you have multiple virtual machines (VMs)? The answer should almost always be yes, at least if you care about availability, scalability and fault tolerance. For example, do you care if your web site, web app or web api goes down? If you answer “yes” then you need to run multiple instances. Why? Because all cloud providers use commodity hardware that can fail, these host servers are where your VMs run and so if the host server fails or underdoes scheduled maintenance, then your VM goes down with it. Now I’m not saying this happens a lot, quite the contrary, however even if you were hosting your servers yourself on premise or in a co-location it’s certainly a best practice to have either backup hardware or multiple active servers in case one of the systems needs to be fixed, patched, rebooted or has a hardware failure.

Availability Sets (AS)

Presuming you are a smarty-pants (and I trust that you are) you might be thinking “Yeah ok I have 2 VMs for my app but how do I ensure they are on different host servers, or different racks so they are fault tolerant” to which I reply “nice thinking!”. This is where Availability Sets (AS), Fault Domains (FD) and Update Domains (UD) come in to play.

Availability Sets (AS) are logical containers of the same resource, such as your website, web api, web app, sql server etc. If you have 2 web servers and 4 api servers then you will want to put the 2 web servers in one availability set, and put the 4 api servers in another, separate availability set. Similarly, if you had a master/slave SQL database servers you’d put them also in their own AS as well.

Why? This quite literally tells Azure “Hey if you are going to reboot or patch host systems only do it to one server at a time in each availability set and make sure it comes back on before moving on to the next one, also make sure these systems are on sufficiently different enough hardware to survive a fault”. This way you always have at least one system up and running in each set (more if you have more than 2 systems in the set).

Fault Domains (FD) & Update Domains (UD)

Ok so you logically grouped your servers into appropriate Availability Sets, but how does that really help with hardware failures and updating? Virtual Machines in an availability set are automatically placed into Fault Domains (FD) & Update Domains (UD).

A Fault Domain (FD) ensures that the VM is on a host with a separate power & network infrastructure. This means if the host machine dies, the power supply fails or the network stack fails in one FD, systems in another FD will remain operational. You can configure between 2 and 3 FDs per AS, Azure will assign your VMs to each FD in sequential order.

An Update Domain (UD) is what enforces which host systems can have scheduled maintenance applied to them (updated/patched/rebooted). By default, resources in the same availability set are spread across 5 UDs to ensure you have as many systems up during planned maintenance (and always one or more as long as the AS has 2 VMs in it).

Putting it all together

Ok so let’s look at a pretty typical real world example where we have some web servers that are load balanced, some api servers that are also load balanced and a master/slave database server.

Infrastructure

  • 2 web servers
  • 4 api servers
  • 2 database servers (master, slave)

Setup

Given the above infrastructure we need 3 Availability Sets.

  • AS #1 called “www” that will contain all our web servers
  • AS #2 called “api” that will contain all our API servers
  • AS #3 called “db” that will contain all our database servers

Keeping in mind each AS has 5 UDs & 3 FDs this is what you would end up with:

VM Role AS UD FD
1 web server www 0 0
2 web server www 1 1
3 api server api 0 0
4 api server api 1 1
5 api server api 2 2
6 api server api 3 0
7 db server db 0 0
8 db server db 1 1

For more on Availability Sets and how to configure them in Azure, I highly recommend reading Manage the availability of virtual machines.

@marc_gagne

Leave a Comment

Your email address will not be published. Required fields are marked *