Anatomy of a mostly-dead network catastrophy

One of the things that you hear over and over again is that “the network is not reliable.”  You hear people say it, blog it, write it down in books, podcast it (I’m sure.) You hear it, you think to yourself “oh… that makes sense…” and you go on your merry way.  You’re developing your web app, and all is well.  You never think about that old saw again… YOUR network is reliable.

Of course it is… its all sitting in one cage.  You have your dedicated high availability pair of managed gigabit switches.  And if the internet connection fails nothing bad happens to your application, it just doesn’t see requests for a while, right? Ut-oh! You’ve blindly wandered into this particularly insideous trap without even knowing it!

Later on your web site is flourishing, traffic is huge, investors are happy.  Memcaching objects took you to the next level (oh no! the trap has teeth!).  The stage is set!  You’ve purchased a second data center.  You have your memcached objects invalidating across the internet, you tested, you deployed, and you’ve just adjusted your DNS. Traffic starts flowing into TWO places at once. All is well, you pat yourself on the back.

Three months later… you’ve been up late… drinking… you’re exhausted and buzzed… its 4:00am… you just got to sleep… And your cell phone goes absolutely haywire. Your baby is dying.

Your httpd connections are all maxed out.  Your caches are out of sync.  Your load average just hit about 50.  In short the sky is falling.  After poking around you realize that you’re seeing 90% packet loss between your two sites.  The http connections are piling up because of the latency involved in the remote memcached invalidations.  Load goes up because the httpd servers are working their butts off and getting nowhere.

Finally it clears up… Go back to sleep right? WRONG.  now your data centers are showing  different data for the same requests!!!  Replication seems to be going fine… AHH memcached.  Those failed sets and deletes… Restart the cache. OH NO! load alerts on the database servers… OH RIGHT… we implimented memcached because it helped out with the db load… makes sense… guess remote-updates/deletes are good but not perfect… what now?

What do you mean what now? You sit and wait for your caches to repopulate from the db, and the httpd connections to stop piling up.  You count your losses, clear everything up, and think long and hard on how to avoid this in the future.

Later on… whose fault was it? It ends up not mattering. Its always an “upstream provider”, or a “peering partner” or a “DOS attack” or some farmer and his back-hoe.  The point is that its not preventable. it will happen again. Thems the breaks.

So what do you do?  Well thats the question isn’t it… I guess it depends on how much cash you have to throw at the problem, the facilities you use, and your application.  But believe me when I give this warning: “Its a hell of a lot harder to think failure early on, but a hell of a lot easier to deal with.”

Between replication, data conflicts, message delivery, message ordering, playing, replaying, and all the other ideas behind the various kinds of fault tolerance there is only one immutable truth:  nothing is ever foolproof.  There is always a single point of failure somewhere if you just look broadly or narrowly enough.  Plan your catastrophes, and choose your battles. Be ready to pick up the pieces.

All that being said… how do *YOU* handle multiple datacenters, disperate networks, writes, synchronization, and caching? I’d love to hear peoples takes on the issue as its an endlessly fascinating subject.

The iPhone… Its not even out yet and everyone is drooling over it

And if they aren’t, they should be!  Ajax has long been the missing link between phones as a mobile computing platform and phones as a simple messaging device.  the fact is that there is a vastly larger poll of people willing to write useful web apps than useful java apps.  I would also argue that it’s easier to write good web apps than java apps of the same magnitude.  So with apples announcement that the iPhone will support web 2.0 standards (read AJAX) what was once a tasty looking new toy has become something more. It’s become a tasty toy with a good enough reason for the cost.   I’d have to pay to break my contract with Sprint, start a contract with Cingular, buy the new iPhone, buy the wife a new phone (shared Sprint plan)…. I’m probably looking at $700-$1000 to make the switch.  And I’m already thinking that its worth it.  I’m going to hold off though… as long as I can stand it.  I want someone to review it, I want to see how the web explosion hits Cingulars networks… I want to see how hard they are to find at first…  Mostly I just want the damn phone really bad… But I’m gonna try to be a good boy and hold off… Maybe

nasty regex

I’m putting this here for documentation purposes… Because getting it right was a very frustrating ordeal (I’d never had to match both positively and negatively in the same regex before)

/^(?(?!.+\.php)^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)|^$)/s

what this is, essentially, saying is “true if the string doesnt match ^.+\.php and the string matches ^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)” The last bit: “|^$” never returns true in my case,because we’re matching on URI’s which are always at least one character long ( “/” )

All things being equal, the simplest solution tends to be the best one.

Occam’s razor strikes again!

Tonight we ran into an interesting problem. A web service – with a very simple time-elapsed check – started reporting negatives… Racking our brain, pouring over the code, produced nothing. It was as if the clock were jumping around randomly! No! On a whim Barry checked it and the clock was, indeed, jumping around…

# while [ 1 ]; do date; sleep 1; done
Wed May 30 04:37:52 UTC 2007
Wed May 30 04:37:53 UTC 2007
Wed May 30 04:37:54 UTC 2007
Wed May 30 04:37:55 UTC 2007
Wed May 30 04:37:56 UTC 2007
Wed May 30 04:37:57 UTC 2007
Wed May 30 04:37:58 UTC 2007
Wed May 30 04:37:59 UTC 2007
Wed May 30 04:38:00 UTC 2007
Wed May 30 04:38:01 UTC 2007
Wed May 30 04:38:02 UTC 2007
Wed May 30 04:38:19 UTC 2007
Wed May 30 04:38:21 UTC 2007
Wed May 30 04:38:22 UTC 2007
Wed May 30 04:38:23 UTC 2007
Wed May 30 04:38:24 UTC 2007
Wed May 30 04:38:08 UTC 2007
Wed May 30 04:38:09 UTC 2007
Wed May 30 04:38:10 UTC 2007
Wed May 30 04:38:28 UTC 2007
Wed May 30 04:38:12 UTC 2007
Wed May 30 04:38:30 UTC 2007
Wed May 30 04:38:31 UTC 2007
Wed May 30 04:38:32 UTC 2007
Wed May 30 04:38:33 UTC 2007
Wed May 30 04:38:34 UTC 2007
Wed May 30 04:38:35 UTC 2007
Wed May 30 04:38:19 UTC 2007
Wed May 30 04:38:20 UTC 2007
Wed May 30 04:38:21 UTC 2007
Wed May 30 04:38:22 UTC 2007
Wed May 30 04:38:40 UTC 2007
Wed May 30 04:38:41 UTC 2007
Wed May 30 04:38:42 UTC 2007
Wed May 30 04:38:43 UTC 2007
Wed May 30 04:38:44 UTC 2007

You’re only ever done debugging for now.

I’m the kinda guy who owns up to my mistakes. I also strive to be the kinda guy who learns from them.  So I figured I would pass this on as some good advice from a guy who’s “screwed that pooch”

There was a project on which I was working, and that project sent me e-mail messages with possible problem alerts.  All was going well, and at some point I turned off those alerts.  I don’t remember when.  And I don’t remember why.  Which means I was probably “Cleaning up” the code.  It was, after all, running well (I guess.)  But along comes a bug introduced with new functionality (ironically a from somewhere WAAAAAAY up the process chain from my project).  And WHAM, errors up the wazzoo.  But no e-mails. Oops. Needless to say the cleanup process was long and tedious… especially for something that was avoidable.

I’ve since put the alerting code back into the application, and have my happy little helpers in place fixing the last of the resulting issues.

The lesson to be taken from this is that you’re only ever done debugging for now. Because tomorrow that code, thats working perfectly now, wont be working perfectly anymore.  And that the sources for entropy are, indeed, endless.

DivShare, Day 1 (raw commentary)

I began looking at divshare a few days ago as a way to stor, save, and share my personal photo collection.  The idea of auto-galleries, unlimited space, flash video, and possible FTP access was… enticing.  But it’s tough to tell how something like this is going to work on a large scale…

So… after messing around with a free divshare account for a while I decided it was more worth my while to pay 10 bucks for a pro account and get FTP access than to try and use mechanize (or something similar) to hack out my own makeshift API.  Now I have about… Oh… 8,000 files I want to upload… So… doing that 10 at a time was just _NOT_ going to happen…

After paying for a pro account I was *immediately* granted FTP access, no waiting. And for that I was grateful.  Since I take photos at 6MP, and thats WAY too large for most online uses I have a shell script which automagically creates 5%, 10%, and 25% or original sized thumbnails.  This meant that I had an expansive set of files I could upload and only take a couple of hours doing it (5% thumbs end up being less than 200Mb.)  This, I thought, would be an excellent test of their interfaces.

So an-ftping-i-a-go.  Upload all my files into a sub directory (005). Visit the dash. nothing. Visit the ftp-upload-page to recheck… maybe I did something wrong. AND WHAM! an 8,000 check box form to accept ftp uploaded files… ugh.  Thankfully they’re all checked by default.  I let it load (for a long while) and hit submit… and wait… and wait… and wait.  Then the server side connection times out a while later.  Fair enough. Check my dash… about 1500 of the 8,000 photos were imported… I’m going to have to do this 6 times. Annoying, but doable.  Hit the second submit, and pop open another browser to look at my dash.  And divshare did *nothing* with my folder name… that wasnt translated to a “virtual” folder at all. tsk tsk.

So I need to put about 1500 photo, manually, into an 005 folder… and then I realize… I have to do this 20 files at a time… with no way to just show files that are not currently in a folder.

… uh no …

Ok, so I open up one of the photos that I DID put into the 005 folder, and it did, in fact, make them into a “gallery” of sorts. It made a thumbnail , and displayed all 3,000 photos side by side in something similar to an iframe… no rows. just one row… 3,000 columns… and waiting as my browser requests each… and every… thumb… from divshare. Wonderful.  The gallery controls are simple enough an iframe with a scrollbar at the bottom, a next photo link, and a previous photo link.  And all 3 controls make you loose your place in the iframe when you use them…

Now dont get me wrong. You get what you pay for. But hey… I did pay this time ;)  The service is excellent for what it does. And my use case was a bit extreme. Still I hope that they address these issues that I’ve pointed out.  I’d really like to continue using them, and if they can make my pohoto process easier I’ll gladly keep paying them $10/mo

Thats

  1. Don’t ignore what pro users are telling you when they upload
  2. Process large-accepts in the background, let me know I need to come back later
  3. Negative searching (folder == nil)
  4. Mass file controls (Iether items/page, or all-items-in-view (folder == nil))
  5. Give me a gallery a non-broadband user can use (1500 thumbs in one sitting tastes bad, more filling)
  6. Don’t undo what I’ve done in the gallery every click.  Finding your place among 8,000 photos is tedious to do once

And I know I sound like I’m just complaining. And I am. But this is web 2.0 feedback baby. Ignore my grouchiness, and (If I’m lucky) take my suggestions and run with them asap.  The photo/files market is very very far from cornered!