Anatomy of a mostly-dead network catastrophy

One of the things that you hear over and over again is that “the network is not reliable.”  You hear people say it, blog it, write it down in books, podcast it (I’m sure.) You hear it, you think to yourself “oh… that makes sense…” and you go on your merry way.  You’re developing your web app, and all is well.  You never think about that old saw again… YOUR network is reliable.

Of course it is… its all sitting in one cage.  You have your dedicated high availability pair of managed gigabit switches.  And if the internet connection fails nothing bad happens to your application, it just doesn’t see requests for a while, right? Ut-oh! You’ve blindly wandered into this particularly insideous trap without even knowing it!

Later on your web site is flourishing, traffic is huge, investors are happy.  Memcaching objects took you to the next level (oh no! the trap has teeth!).  The stage is set!  You’ve purchased a second data center.  You have your memcached objects invalidating across the internet, you tested, you deployed, and you’ve just adjusted your DNS. Traffic starts flowing into TWO places at once. All is well, you pat yourself on the back.

Three months later… you’ve been up late… drinking… you’re exhausted and buzzed… its 4:00am… you just got to sleep… And your cell phone goes absolutely haywire. Your baby is dying.

Your httpd connections are all maxed out.  Your caches are out of sync.  Your load average just hit about 50.  In short the sky is falling.  After poking around you realize that you’re seeing 90% packet loss between your two sites.  The http connections are piling up because of the latency involved in the remote memcached invalidations.  Load goes up because the httpd servers are working their butts off and getting nowhere.

Finally it clears up… Go back to sleep right? WRONG.  now your data centers are showing  different data for the same requests!!!  Replication seems to be going fine… AHH memcached.  Those failed sets and deletes… Restart the cache. OH NO! load alerts on the database servers… OH RIGHT… we implimented memcached because it helped out with the db load… makes sense… guess remote-updates/deletes are good but not perfect… what now?

What do you mean what now? You sit and wait for your caches to repopulate from the db, and the httpd connections to stop piling up.  You count your losses, clear everything up, and think long and hard on how to avoid this in the future.

Later on… whose fault was it? It ends up not mattering. Its always an “upstream provider”, or a “peering partner” or a “DOS attack” or some farmer and his back-hoe.  The point is that its not preventable. it will happen again. Thems the breaks.

So what do you do?  Well thats the question isn’t it… I guess it depends on how much cash you have to throw at the problem, the facilities you use, and your application.  But believe me when I give this warning: “Its a hell of a lot harder to think failure early on, but a hell of a lot easier to deal with.”

Between replication, data conflicts, message delivery, message ordering, playing, replaying, and all the other ideas behind the various kinds of fault tolerance there is only one immutable truth:  nothing is ever foolproof.  There is always a single point of failure somewhere if you just look broadly or narrowly enough.  Plan your catastrophes, and choose your battles. Be ready to pick up the pieces.

All that being said… how do *YOU* handle multiple datacenters, disperate networks, writes, synchronization, and caching? I’d love to hear peoples takes on the issue as its an endlessly fascinating subject.

nasty regex

I’m putting this here for documentation purposes… Because getting it right was a very frustrating ordeal (I’d never had to match both positively and negatively in the same regex before)

/^(?(?!.+\.php)^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)|^$)/s

what this is, essentially, saying is “true if the string doesnt match ^.+\.php and the string matches ^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)” The last bit: “|^$” never returns true in my case,because we’re matching on URI’s which are always at least one character long ( “/” )

All things being equal, the simplest solution tends to be the best one.

Occam’s razor strikes again!

Tonight we ran into an interesting problem. A web service – with a very simple time-elapsed check – started reporting negatives… Racking our brain, pouring over the code, produced nothing. It was as if the clock were jumping around randomly! No! On a whim Barry checked it and the clock was, indeed, jumping around…

# while [ 1 ]; do date; sleep 1; done
Wed May 30 04:37:52 UTC 2007
Wed May 30 04:37:53 UTC 2007
Wed May 30 04:37:54 UTC 2007
Wed May 30 04:37:55 UTC 2007
Wed May 30 04:37:56 UTC 2007
Wed May 30 04:37:57 UTC 2007
Wed May 30 04:37:58 UTC 2007
Wed May 30 04:37:59 UTC 2007
Wed May 30 04:38:00 UTC 2007
Wed May 30 04:38:01 UTC 2007
Wed May 30 04:38:02 UTC 2007
Wed May 30 04:38:19 UTC 2007
Wed May 30 04:38:21 UTC 2007
Wed May 30 04:38:22 UTC 2007
Wed May 30 04:38:23 UTC 2007
Wed May 30 04:38:24 UTC 2007
Wed May 30 04:38:08 UTC 2007
Wed May 30 04:38:09 UTC 2007
Wed May 30 04:38:10 UTC 2007
Wed May 30 04:38:28 UTC 2007
Wed May 30 04:38:12 UTC 2007
Wed May 30 04:38:30 UTC 2007
Wed May 30 04:38:31 UTC 2007
Wed May 30 04:38:32 UTC 2007
Wed May 30 04:38:33 UTC 2007
Wed May 30 04:38:34 UTC 2007
Wed May 30 04:38:35 UTC 2007
Wed May 30 04:38:19 UTC 2007
Wed May 30 04:38:20 UTC 2007
Wed May 30 04:38:21 UTC 2007
Wed May 30 04:38:22 UTC 2007
Wed May 30 04:38:40 UTC 2007
Wed May 30 04:38:41 UTC 2007
Wed May 30 04:38:42 UTC 2007
Wed May 30 04:38:43 UTC 2007
Wed May 30 04:38:44 UTC 2007

Backgrounding Chained Commands in Bash

Sometimes it’s desirable to have a chain of commands backgrounded so that a multi-step process can be run in parallel. And often times its not desirable to have yet another script made to do a simple task that doesn’t warrant the added complexity. An example of this would be running backups in parallel. The script sniplet below would allow up to 4 simultaneous tar backups to run at once — recording the start and stop times of each individually — and then wait for all the tar processes to finish before exiting

max_tar_count=4
for i in 1 3 5 7 2 4 6 8
 do
 cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
 if [ $cur_tar_count -ge $max_tar_count ]
  then 
  while [ $cur_tar_count -ge $max_tar_count ]
   do
   sleep 60
   cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
  done
 fi
 ( echo date > /backups/$i.start &&
     tar /backups/$i.tar  /data/$i &&
     echo date > /backups/$i.stop )&
done
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
while [ $cur_tar_count -gt 0 ]
 do
 sleep 60
 cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
done

The real magick above is highlighted in red. You DO want that last loop in there to make the script wait until all the backups are really done before exiting.

You’re only ever done debugging for now.

I’m the kinda guy who owns up to my mistakes. I also strive to be the kinda guy who learns from them.  So I figured I would pass this on as some good advice from a guy who’s “screwed that pooch”

There was a project on which I was working, and that project sent me e-mail messages with possible problem alerts.  All was going well, and at some point I turned off those alerts.  I don’t remember when.  And I don’t remember why.  Which means I was probably “Cleaning up” the code.  It was, after all, running well (I guess.)  But along comes a bug introduced with new functionality (ironically a from somewhere WAAAAAAY up the process chain from my project).  And WHAM, errors up the wazzoo.  But no e-mails. Oops. Needless to say the cleanup process was long and tedious… especially for something that was avoidable.

I’ve since put the alerting code back into the application, and have my happy little helpers in place fixing the last of the resulting issues.

The lesson to be taken from this is that you’re only ever done debugging for now. Because tomorrow that code, thats working perfectly now, wont be working perfectly anymore.  And that the sources for entropy are, indeed, endless.

DivShare, Day 1 (raw commentary)

I began looking at divshare a few days ago as a way to stor, save, and share my personal photo collection.  The idea of auto-galleries, unlimited space, flash video, and possible FTP access was… enticing.  But it’s tough to tell how something like this is going to work on a large scale…

So… after messing around with a free divshare account for a while I decided it was more worth my while to pay 10 bucks for a pro account and get FTP access than to try and use mechanize (or something similar) to hack out my own makeshift API.  Now I have about… Oh… 8,000 files I want to upload… So… doing that 10 at a time was just _NOT_ going to happen…

After paying for a pro account I was *immediately* granted FTP access, no waiting. And for that I was grateful.  Since I take photos at 6MP, and thats WAY too large for most online uses I have a shell script which automagically creates 5%, 10%, and 25% or original sized thumbnails.  This meant that I had an expansive set of files I could upload and only take a couple of hours doing it (5% thumbs end up being less than 200Mb.)  This, I thought, would be an excellent test of their interfaces.

So an-ftping-i-a-go.  Upload all my files into a sub directory (005). Visit the dash. nothing. Visit the ftp-upload-page to recheck… maybe I did something wrong. AND WHAM! an 8,000 check box form to accept ftp uploaded files… ugh.  Thankfully they’re all checked by default.  I let it load (for a long while) and hit submit… and wait… and wait… and wait.  Then the server side connection times out a while later.  Fair enough. Check my dash… about 1500 of the 8,000 photos were imported… I’m going to have to do this 6 times. Annoying, but doable.  Hit the second submit, and pop open another browser to look at my dash.  And divshare did *nothing* with my folder name… that wasnt translated to a “virtual” folder at all. tsk tsk.

So I need to put about 1500 photo, manually, into an 005 folder… and then I realize… I have to do this 20 files at a time… with no way to just show files that are not currently in a folder.

… uh no …

Ok, so I open up one of the photos that I DID put into the 005 folder, and it did, in fact, make them into a “gallery” of sorts. It made a thumbnail , and displayed all 3,000 photos side by side in something similar to an iframe… no rows. just one row… 3,000 columns… and waiting as my browser requests each… and every… thumb… from divshare. Wonderful.  The gallery controls are simple enough an iframe with a scrollbar at the bottom, a next photo link, and a previous photo link.  And all 3 controls make you loose your place in the iframe when you use them…

Now dont get me wrong. You get what you pay for. But hey… I did pay this time ;)  The service is excellent for what it does. And my use case was a bit extreme. Still I hope that they address these issues that I’ve pointed out.  I’d really like to continue using them, and if they can make my pohoto process easier I’ll gladly keep paying them $10/mo

Thats

  1. Don’t ignore what pro users are telling you when they upload
  2. Process large-accepts in the background, let me know I need to come back later
  3. Negative searching (folder == nil)
  4. Mass file controls (Iether items/page, or all-items-in-view (folder == nil))
  5. Give me a gallery a non-broadband user can use (1500 thumbs in one sitting tastes bad, more filling)
  6. Don’t undo what I’ve done in the gallery every click.  Finding your place among 8,000 photos is tedious to do once

And I know I sound like I’m just complaining. And I am. But this is web 2.0 feedback baby. Ignore my grouchiness, and (If I’m lucky) take my suggestions and run with them asap.  The photo/files market is very very far from cornered!

tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.

DB:/users

DB:/items

DB:/tags

becomes

DBTags:/tags

DBUsers:/users

DBItems:/items

And then

DBUsers:/users

Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.

DBItems:/

  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size

DBTags:/

  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…

Ec2a

  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)

Ec2b

  • items 2-3 (w)
  • tags 2-3 (w)

Ec2c

  • items 4-5 (w)
  • tags 4-5 (w)

Ec2d

  • items 6-7 (w)
  • tags 6-7 (w)

Ec2e

  • items 8-9 (w)
  • tags 8-9 (w)

Ec2f

  • items 0-1 (r)
  • tags 0-1 (r)

Ec2g

  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)

Ec2h

  • items 4-5 (r)
  • tags 4-5 (r)

Ec2i

  • items 6-7 (r)
  • tags 6-7 (r)

Ec2j

  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

Whoa, talk about neglecting your weblog! Bad Form!

I know, I know, I’ve been silent for quite some time. Well Let me assure you that I’m quite all right! Are you less worried about me now? Oh good. (Yes I’m a cynical bastage sometimes.)

So life has, as it tends to do, come at me pretty fast. I’ve left my previous employer, Ookles, and I wish them all the best in accomplishing everything that they’ve been working towards. So I’ve Joined up with the very smart, very cool guys at Automattic. I have to tell you I’m excited to be working with these guys, they’re truly a great group.

I guess that means I’m… kind of… like… obligated to keep up on my blog now, eh?  I’m also kind of, like, ehausted. Jumping feet first into large projects has a tendency to do that to a guy though.  And truth be told I would have it any other way…

😀

Cheers

DK