Anatomy of a mostly-dead network catastrophy

One of the things that you hear over and over again is that “the network is not reliable.”  You hear people say it, blog it, write it down in books, podcast it (I’m sure.) You hear it, you think to yourself “oh… that makes sense…” and you go on your merry way.  You’re developing your web app, and all is well.  You never think about that old saw again… YOUR network is reliable.

Of course it is… its all sitting in one cage.  You have your dedicated high availability pair of managed gigabit switches.  And if the internet connection fails nothing bad happens to your application, it just doesn’t see requests for a while, right? Ut-oh! You’ve blindly wandered into this particularly insideous trap without even knowing it!

Later on your web site is flourishing, traffic is huge, investors are happy.  Memcaching objects took you to the next level (oh no! the trap has teeth!).  The stage is set!  You’ve purchased a second data center.  You have your memcached objects invalidating across the internet, you tested, you deployed, and you’ve just adjusted your DNS. Traffic starts flowing into TWO places at once. All is well, you pat yourself on the back.

Three months later… you’ve been up late… drinking… you’re exhausted and buzzed… its 4:00am… you just got to sleep… And your cell phone goes absolutely haywire. Your baby is dying.

Your httpd connections are all maxed out.  Your caches are out of sync.  Your load average just hit about 50.  In short the sky is falling.  After poking around you realize that you’re seeing 90% packet loss between your two sites.  The http connections are piling up because of the latency involved in the remote memcached invalidations.  Load goes up because the httpd servers are working their butts off and getting nowhere.

Finally it clears up… Go back to sleep right? WRONG.  now your data centers are showing  different data for the same requests!!!  Replication seems to be going fine… AHH memcached.  Those failed sets and deletes… Restart the cache. OH NO! load alerts on the database servers… OH RIGHT… we implimented memcached because it helped out with the db load… makes sense… guess remote-updates/deletes are good but not perfect… what now?

What do you mean what now? You sit and wait for your caches to repopulate from the db, and the httpd connections to stop piling up.  You count your losses, clear everything up, and think long and hard on how to avoid this in the future.

Later on… whose fault was it? It ends up not mattering. Its always an “upstream provider”, or a “peering partner” or a “DOS attack” or some farmer and his back-hoe.  The point is that its not preventable. it will happen again. Thems the breaks.

So what do you do?  Well thats the question isn’t it… I guess it depends on how much cash you have to throw at the problem, the facilities you use, and your application.  But believe me when I give this warning: “Its a hell of a lot harder to think failure early on, but a hell of a lot easier to deal with.”

Between replication, data conflicts, message delivery, message ordering, playing, replaying, and all the other ideas behind the various kinds of fault tolerance there is only one immutable truth:  nothing is ever foolproof.  There is always a single point of failure somewhere if you just look broadly or narrowly enough.  Plan your catastrophes, and choose your battles. Be ready to pick up the pieces.

All that being said… how do *YOU* handle multiple datacenters, disperate networks, writes, synchronization, and caching? I’d love to hear peoples takes on the issue as its an endlessly fascinating subject.

Bash: “I Can’t Eat Another Byte”

root@server:/dir/ # ls | wc -l

1060731

root@server:/dir/ # for i in *; do rm -v $i done; done

me@home:~/ #

HUH?

Turns out that bash just couldn’t eat another byte, and next time I logged in I saw this: “bash[5469]: segfault at 0000007fbf7ffff8 rip 00000000004749bf rsp 0000007fbf7fffe0 error 6“… Impressive 🙂

LOLSPAM?

support to spammer: “are you a spammer?”

spammer to support: “oh hai. oh noes!1!!!”

support to spammer: “then what about all this?”

spammer to support “oh noes! they must have didz  used mah open proxee i forgotted about which sawmhow got installed on all 200 of my vertual hostez!!1! teh icanhasspam script”

support to spammer: “oh, sounds reasonable, dont let it happen again okay?”

spammer to support: “four surez!!1! lolz!

Time is as time was as time will be

Every time I start dealing with time, programatically, I’m reminded of this quote

Our units of temporal measurement, from seconds on up to months,
are so complicated, asymmetrical and disjunctive so as to make
coherent mental reckoning in time all but impossible. Indeed, had
some tyrannical god contrived to enslave our minds to time, to
make it all but impossible for us to escape subjection to sodden
routines and unpleasant surprises, he could hardly have done
better than handing down our present system. It is like a set of
trapezoidal building blocks, with no vertical or horizontal
surfaces, like a language in which the simplest thought demands
ornate constructions, useless particles and lengthy
circumlocutions. Unlike the more successful patterns of language
and science, which enable us to face experience boldly or at least
level-headedly, our system of temporal calculation silently and
persistently encourages our terror of time.

… It is as though architects had to measure length in feet,
width in meters and height in ells; as though basic instruction
manuals demanded a knowledge of five different languages. It is
no wonder then that we often look into our own immediate past or
future, last Tuesday or a week from Sunday, with feelings of
helpless confusion. …

— Robert Grudin, `Time and the Art of Living’.

# info coreutils date

Scripting without killing system load

Let us pretend for a moment that you have a critical system which can *just* handle the strain that it’s under (I’m sure all of you have workloads well under your system capabilities, or capabilities well over your workload requirements, of course; still for the sake of argument…) And you have a job to do which will induce more load. The job has to be done. The system has to remain responsive. Your classic response to this problem is adding a delay, for example:

     #!/bin/bash
     cd /foo
     find ./ -type d -daystart -ctime +1 -maxdepth 1 | head -n 500 | xargs -- rm -rv
     while [ $? -eq 0 ]; do
          sleep 60
          find ./ -type d -daystart -ctime +1 -maxdepth 1 | head -n 500 | xargs -- rm -rv
     done

Of course this is a fairly simplistic example. Still it illustrates my point. The problem with this solution is that the machine you’re working on is likely to have a variable workload where its main use comes in surges. By defining a sleep time you have to iether sleep so long that the job takes forever to finish, or skirt with high loads and slow response times. Ideally you would be able to let her rip while the load is low and throttle her back while the load is high, right? Well we can! Like so:

     #!/bin/bash
     function waitonload() {
          loadAvg=$(cat /proc/loadavg | cut -f1 -d'.')
          while [ $loadAvg -gt $1 ]; do
               sleep 1
               echo -n .
               loadAvg=$(cat /proc/loadavg | cut -f1 -d'.')
               if [ $loadAvg -le $1 ]; then echo; fi
          done
     }

     waitonload 1
     find ./ -type d -daystart -ctime +1 -maxdepth 1 | head -n 500 | xargs -- rm -rv
     while [ $? -eq 0 ]; do
          waitonload 1
          find ./ -type d -daystart -ctime +1 -maxdepth 1 | head -n 500 | xargs -- rm -rv
     done

This modification will only run the desired commands when the system load is less than 2, it will wait for that condition to continue the loop. This can be very handy for very large jobs needing to be run on loaded systems. Especially jobs which can be subdivided into small tasks!

And we’re off

So now back off to the vet to pick up the dog who’s gonna spend most of the next few weeks being MISERABLE because he’s gonna have to be crated a a LOT, wont be able to play with the other dogs at all for a while, and will have to wear a satellite-dish-collar. On the plus side he’ll have a drain installed in his ear and he’ll be leaking gross fluids form his head for a while. I guess theres ALWAYS a silver lining, insn’t there?!

quick update

Well… everything that could go wrong this weekend (outside work) has gone wrong.

My replay tv is dying

My wifes laptop is dying

I have one dog getting out of surgery right now because of a hematoma

I have one cat with ringworm

One cat got the crap beat up out of her in a fight

I had my tire punctured so badly it needs replacing (this was on the way to the vet yesterday night when we noticed the swelling in buddy’s ear (the hematoma))

The dogs chewed a hose

One dog ate my wifes flipflops while we were away and puked pieces of it up all over the house

We have to go back to Fremont sometime this week to return the old cable receiver and modem or get sent to collections

Oh, and the IRS sent us a letter about a form missing from my tax return (I used TaxCut to avoid exactly this kind of crap!)

On the plus side… We’re alive 🙂

The iPhone… Its not even out yet and everyone is drooling over it

And if they aren’t, they should be!  Ajax has long been the missing link between phones as a mobile computing platform and phones as a simple messaging device.  the fact is that there is a vastly larger poll of people willing to write useful web apps than useful java apps.  I would also argue that it’s easier to write good web apps than java apps of the same magnitude.  So with apples announcement that the iPhone will support web 2.0 standards (read AJAX) what was once a tasty looking new toy has become something more. It’s become a tasty toy with a good enough reason for the cost.   I’d have to pay to break my contract with Sprint, start a contract with Cingular, buy the new iPhone, buy the wife a new phone (shared Sprint plan)…. I’m probably looking at $700-$1000 to make the switch.  And I’m already thinking that its worth it.  I’m going to hold off though… as long as I can stand it.  I want someone to review it, I want to see how the web explosion hits Cingulars networks… I want to see how hard they are to find at first…  Mostly I just want the damn phone really bad… But I’m gonna try to be a good boy and hold off… Maybe