Anatomy of a mostly-dead network catastrophy

One of the things that you hear over and over again is that “the network is not reliable.”  You hear people say it, blog it, write it down in books, podcast it (I’m sure.) You hear it, you think to yourself “oh… that makes sense…” and you go on your merry way.  You’re developing your web app, and all is well.  You never think about that old saw again… YOUR network is reliable.

Of course it is… its all sitting in one cage.  You have your dedicated high availability pair of managed gigabit switches.  And if the internet connection fails nothing bad happens to your application, it just doesn’t see requests for a while, right? Ut-oh! You’ve blindly wandered into this particularly insideous trap without even knowing it!

Later on your web site is flourishing, traffic is huge, investors are happy.  Memcaching objects took you to the next level (oh no! the trap has teeth!).  The stage is set!  You’ve purchased a second data center.  You have your memcached objects invalidating across the internet, you tested, you deployed, and you’ve just adjusted your DNS. Traffic starts flowing into TWO places at once. All is well, you pat yourself on the back.

Three months later… you’ve been up late… drinking… you’re exhausted and buzzed… its 4:00am… you just got to sleep… And your cell phone goes absolutely haywire. Your baby is dying.

Your httpd connections are all maxed out.  Your caches are out of sync.  Your load average just hit about 50.  In short the sky is falling.  After poking around you realize that you’re seeing 90% packet loss between your two sites.  The http connections are piling up because of the latency involved in the remote memcached invalidations.  Load goes up because the httpd servers are working their butts off and getting nowhere.

Finally it clears up… Go back to sleep right? WRONG.  now your data centers are showing  different data for the same requests!!!  Replication seems to be going fine… AHH memcached.  Those failed sets and deletes… Restart the cache. OH NO! load alerts on the database servers… OH RIGHT… we implimented memcached because it helped out with the db load… makes sense… guess remote-updates/deletes are good but not perfect… what now?

What do you mean what now? You sit and wait for your caches to repopulate from the db, and the httpd connections to stop piling up.  You count your losses, clear everything up, and think long and hard on how to avoid this in the future.

Later on… whose fault was it? It ends up not mattering. Its always an “upstream provider”, or a “peering partner” or a “DOS attack” or some farmer and his back-hoe.  The point is that its not preventable. it will happen again. Thems the breaks.

So what do you do?  Well thats the question isn’t it… I guess it depends on how much cash you have to throw at the problem, the facilities you use, and your application.  But believe me when I give this warning: “Its a hell of a lot harder to think failure early on, but a hell of a lot easier to deal with.”

Between replication, data conflicts, message delivery, message ordering, playing, replaying, and all the other ideas behind the various kinds of fault tolerance there is only one immutable truth:  nothing is ever foolproof.  There is always a single point of failure somewhere if you just look broadly or narrowly enough.  Plan your catastrophes, and choose your battles. Be ready to pick up the pieces.

All that being said… how do *YOU* handle multiple datacenters, disperate networks, writes, synchronization, and caching? I’d love to hear peoples takes on the issue as its an endlessly fascinating subject.

Leave a Reply