Rules of thumb for high availability systems (Infrastructure)

Never be at more than ½ capacity

If you’re planning a truly highly available system then you have to be aware that a serious percentage of your hardware can be forcefully cut from your organizations torso at any moment. You are also not exempt from this rule on holidays, weekends, or vacations. Loosing power equipment, Loosing networking gear, the help tripping over cables, acts of God. If you aren’t prepared to have a random half of your organizations hardware disconnected at any moment then you aren’t H.A. yet.

If you don’t have 2 spares then you arent yet ready

Murphy was an optimist. If you’ve ever replaced a dying (or dead) hard drive with a new hard drive which… doesn’t work. Or ram, or a CPU. Then you haven’t been in ops long enough. Sometimes your backup plan needs a backup plan. And you have to have it. Theres no excuse for being off line, so you need not only one but two (or more) possible replacements for a point of failure.

Disaster Recovery is an ongoing process

The tricky thing about Highly Available systems is that you have to be working… while you’re recovering. Any time you’re planning your HA setup, and you work around a point of failure, stop and think a moment on what it will take to replace that failed point. If it required bringing things down again… thats no good.

Growth planning should always be done in exponents

Never again are you to talk (or think) of doubling growth. You shall from this point forward think in squares, and cubes, and the like. In the age of information you tend to gather data at an alarming rate, don’t let it overtake you!

If you depend on a backup, it’s not HA

“Whats that? The primary server is off line? Do we have a spare? No, but we have a backup. How long? Oh… 36 hours… What? No, I can’t speed it up.” Lets face it if you’re restoring your live system from backup you’ve screwed the pooch. Backup is NOT high availability but it is good practice, and when it comes down to it 36 hours is marginally better than never.

Self healing requires more thought than you’ve given it

The simple fact of life in the data center is that all services are an interlocking tapestry. And if the threads break down the tassels fall off. Self healing is not only about detection and removal its also about rerouting data. If the database server that you normally write to has gone down, you can detect it, but can you instantly rewire the 8 different internal services which feed into the database to write to a different server? And then back again?

DNS is harder than you think, and it matters more than ever

The one infrastructure that people rely on most, and know the least about, is DNS. Dns might as well be a piece of hardware, because if your users cant type in to get to you, theres absolutely zero chance they’ll have your IP address handy. Worse yet, DNS is the number one thing that I see administrators screw up all the time. Talking zone files with (sometimes veteran) administrators is like talking in Klingon to a 2 year old. It usually doesn’t work too well.

Rules of thumb for high availability systems (Databases)

Replicating data takes longer than you think

In this brave new world of terrabytes per week theres a nasty truth. Replicating that much data across a large number of nodes is a headache. And it’s usually not as fast as you want it to be. Instantaneous replication is nice, but generally speaking you’re writing to one server and reading from X number of others. Your read servers, therefor, not only bar the same load as the write server (having to replicate everything that goes into the write server) but has to bear the additional load of supporting the read requests. A frequent mistake that admins make is putting the best hardware into the write server, and using lesser machines for the read servers. But if you’re truly processing large amounts of data this create a dangerous situation where if a read server stops for a while it might take days or weeks to catch up. Bad juju.

Less is more, and then more is more, and then less is more again

In the beginning you had data optomization. Everything pointed to something, and your masterfully crafted database schema duplicated absolutely no piece of information. And then you increased your size and volume to the point that this approach became too cumbersome to sustain your access time. You moved over to a new schema where you could select all the data you need in one statement, but data is duplicated everywhere. And finally this monolithic approach has locked you into multi-million dollar pieces of hardware, so you need to re-normalize your data so that it can be partitioned onto multiple clusters. Expect this, Plan for it, and be prepared for the hard truth: this is a truly painful process!

Spend the money here, if nowhere else

If you deal in information, you absolutely have to spend real money here. This is not the place to skimp. If you do… you’ll be sorry.

Rules of thumb for high availability systems (Employees and Departments)

False positives breed contempt

If you routinely get SMS alerts for no reason at 3:00am when you’re sound asleep. And it always ends up being a false alarm. There will come a time when you just opt to ignore the pager. And this time not only will wolf have been cried, the flock is truly under attack. Always always work to reduce false positives, and set reasonable alerting thresholds. Is something an emergency worth getting up for at 3:00am? Or isn’t it? Sure a web server went down, and was removed. But there are 13 others all functioning. You can sleep. But if you lost half of them… somethings probably up!

No department is an island

Contrary to popular belief, it takes more than the ops department to design a truly HA system. For example, your admins aren’t allowed to just start monkeying with the database schema when they feel like it. Sure its more highly available now, but the application cant use it any more. Just as no man is an island, neither is the ops department. You can work with them (good) or you can work against them (bad) but choose wisely.

If operations warns that the sky is going to fall, take them seriously

Lets face it. If your auto mechanic says your alternator will die very soon – you replace it. If your inspector says you’ve got the beginnings of a termite problem – you adddress it. If your weatherman tells you it might rain today – you grab your umbrella on your way out the door. And when your ops team comes into your office telling you that you have exactly 90 days until your database server becomes a very heavy very hot very expensive paper weight – why would you ignore that? Usually when ops says the sky is about to fall it’s because they were up in the clouds fixing the slightly off color shade of silver you were complaining about and saw the cracks forming themselves. Ignore them at your own risk, but don’t say they didn’t warn you.

If you don’t spend the money on ops, nothing runs.

Without your engine your car doesnt run. Without your heart you die. And without giving the ops necessary resources department the application that you’ve invested so heavily in will not run. Because there will be nothing to run it on. Or worse yet: it’ll run but something will break every 3rd day. You cannot skimp here. Well you can, but you don’t get high availability as well as a low price tag. It’s a pain in the ass… but when you bought the Saturn you had no right to expect Nascar results.

6 thoughts on “Rules of thumb for high availability systems (Infrastructure)

  1. Anonymous says:

    You obviously never actually ran a high availabllity datacenter – or at least a cost effective one. Preparing for every potential problem is prohibitively expensive and if impossible (2 spares for everything? Get a grip!)

    You need to learn about risk managment and not always rely on risk avoidance or complete mitigation which is impossible.

    I am not sure where you work (or if you even do since bloggers usually have way too much time on their hands but you must be either a very bright 12 year old or a really studid 20 something who has a lot to learn.

  2. "You obviously never actually ran a high availabllity datacenter"

    I think you're being a little too judgemental – I'm not at all sure the author is saying preparing for every potential problem is appropriate. Firstly, the author is talking about rules of thumb, second of all you could interpret the post as "in a perfect world we'd do this" and treat it as a thought exercise about what is and is not possible.

    Certainly risk management is important but how about a disaster recovery plan or consideration of what is or is not important.

    Lastly, if you are going to hurl about insults, posting under the banner of anonymous can only lead to you being labelled a coward who isn't prepared to stand up and argue face to face – shame on you.

  3. Thank you, Dan. I appreciate your comment. It's true, this is a mixture of perfect-world thinking and thought-excersice. I wish it were possible to follow all of these rules to a T, but alas the real world calles for compromises and conflicting goals.

  4. Sotweed Schnible says:

    As as senior datacenter architect for a very large fortune 500 (sorry can't say who but we have 5 nines uptime) this post is a little naive and is also stating some pretty basic things which I guess was the point.

    As far as the content I could care less but the COMMENTS ARE HILARIOUS – Dan Cresswel, the POSTER is anonymous, also using some sort of Hacker moniker which means it is probably a pretty smart teenager with a lot of time on his hands. Anonymous – I guess you should reveal yourself, you are right but a little harsh – however harshness is real world.