We’ve been asked how we manage serving files from Amazons very cool S3 service at… This is how. (covering a requested image already stored on S3, not the upload -> s3 process)

A request comes into pound for a file. Pound hashes the hostname (via a custom patch which we have not, but may, release) , to determine which of several backend servers the request should hit. Pound forwards the request to that server. This, of course, means that a given blog always serves from the same backend server. The only exception to the afore-mentioned rule is if that server is, for some reason, unavailable in which case it picks another server to serve that hostname from temporarily.

The request then comes into varnishd on the backend servers. The varnishd daemon checks its 300Gb worth of files cache and (for the sake of this example) finds nothing (hey, new images are uploaded all the time!) Varnishd then checks with the web server (running on the same machine, just bound to a different IP/Port#) and that request is handled by a custom script.

So, a http daemon on the same backend server runs the file request. The custom script checks the DB to gather information on the file (specifically which DC’s it is in, size, mod time, and whether its deleted or not) all this info is saved in memcached for 5 minutes. The script increments and checks the “hawtness” (term courtesy of Barry) of the file in memcached (if the file has been accessed over a certain # of times it is then deemed “hawt”, and a special header is sent with the response telling varnishd to put the file into its cache. When that happens the request would be served directly by varnishd in the previous paragraph and never hit the httpd or this script again (or at least not until the cache entry expires.)) At this point, assuming the file should exist (deleted = 0 in the files db) we fetch the file from a backend source.

Which backend source depends on where it is available. The order of preference is as follows: Always fetch from Amazon S3 if the file lives there (no matter what, the following preferences only ever occur if, for some reason, s3 = 0 in the files db), and if that fails fetch from the one files server we still have (which has larger slower disks, and is used for archiving purposes and fault tolerance only)

After fetching the file from the back end… the custom script hands the data and programatically generated headers to the http daemon, which hands the data to varnishd, varnishd hands the data to pound, pound hands the data to the requesting client, and the image appears in the web browser.

And there was much rejoicing (yay.)

For the visual people among us who like visuals and stuff… (I like visuals…) here goes…

tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.








And then


Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.


  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size


  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…


  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)


  • items 2-3 (w)
  • tags 2-3 (w)


  • items 4-5 (w)
  • tags 4-5 (w)


  • items 6-7 (w)
  • tags 6-7 (w)


  • items 8-9 (w)
  • tags 8-9 (w)


  • items 0-1 (r)
  • tags 0-1 (r)


  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)


  • items 4-5 (r)
  • tags 4-5 (r)


  • items 6-7 (r)
  • tags 6-7 (r)


  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

Openfound (cont)

If there’s one thing that the OpenFount guys have shown me is that they’re serious about the Infinidisk product.  Mr. Donahue gave me a quick call this evening (seems my e-mail server and his e-mail server aren’t talking properly, so while I get his communications) he has not received mine (probably explaining the lack of response to my pre-sales inquiry)) to chat about his product. The particular bug that I noticed, he mentioned, was fixed a while ago in a later release than I’d tried.  In my defense the page looks precisely like it did when I first got the product, and the release tar file has no version numbers on it… yet… so I did check for updates.  I found out some good info, though.  They’re working on  putting up a trac page for some real bidirectional community participation soon.  They’ll also be putting version numbers in their archives soon. Both of those things will help, I think, improve their visibility to people like me (who have very very little time.)
I’ll be re-testing the infinidisk product again, later, when next I customize an AMI.

Kudos to the openfount guys

I’m really very impressed with the speed at which the Openfount guys responded to my last post. I definitely give Kudos to Bill for being on top of things! I’m running out the door so I’ll keep this short and sweet.

He’s right, I did generalize databases into InnoDB, but thats because it’s what I use. So my apologies for that.
I definitely had no intention of badmouthing the Openfount guys (if thats what it sounded like I did, I apologize) Just reporting what I saw, and my impressions.

The Bill – I would have either used

  • apokalyptik
  • at
  • apokalyptik
  • dot
  • com


  • consult
  • at
  • apokalyptik
  • dot
  • com


  • demitrious
  • dot
  • kelly
  • at
  • gmail
  • dot
  • com

Infinidisk Update

I mentioned a while back that I was going to be playing with the S3 Infinidisk Product.  What I found in my testing was that this product is not prime time ready.  There was a nasty bug which caused data to be lost if the mv command was used. The scripts themselves were unintuitive.  They required fancy-pants nohupping or screening to use long term.  Oh, and a database definitely will not work on top of this FS. It seems obvious in retrospect, but I wanted to be sure.  InnoDB wont even build its initial files much less operated on the FS.  To top it all off, My pre-sales support question was never even so much as acknowledged.

No, I think I’ll be leaving this product alone for now and sticking with clever uses of s3sync and s3cmd, thanks.

Google & Microsoft Working Towards the Perfect Datacenter

We all new that this would happen, google and microsoft going vying to build the biggest field of silicon trees.  But what does this mean, and does it tie in with amazons latest service?!  I think that undoubtedly it does.

There’s talk about a last man standing game when it comes to internet bandwidth.  And I can imagine a time when we might see the internet behaving like the freeways in L.A. at rush hour.  But this is more, I think.

I’ve mentioned before that the whole goal here is to “be the internet”.  I don’t think that goal has changed recently.  Google has sown the world two things:  First that there’s a vast amount of power to be wielded by being “the internet” to the average Tom Dick and Harry, and Second that the title is *always* up for grabs.  A while back Yahoo! was the internet, before that AOL was the internet, before that newsgroups were the internet.  Need I say more?  And each of those companies wielded an extreme sway over the comings and goings of the internet.

But now the internet means a lot more than it used to.  Now the internet is sales, it’s revenue, it’s marketing, people are watching, people are reading, people are listening, and– most importantly — people are being influenced by this “new fangled internet thing”, “oh, you mean Google?”

So there’s now a lot more riding on who gets to “be the internet” these days.  The one thing that ginormous corporate entities can’t seem to get a hold of is the fickle way in which the internet is backwards from real world businesses.  In the real world it’s all too common for a newcomer to storm into a market, take hold of it with genuinely better product, and then let all that slip away into mediocrity and poor quality.  And the kicker is that people will *still* pay for it if it’s crap… as long as its tangible. But the internet is fickle. It’s sort of tangible but more or less ethereal.

I think for the first time people outside the scientific communities are getting wind of a crazy idea: insubstantial value.  That is something that didn’t have value a minute ago, wont have value a minute from now, but at the moment is extremely valuable.  Which, inherently, means that this thing has the constant need to justify itself.  I’m no economics guy, and I’m certainly not in touch with the “average Joe” (who would almost certainly not follow me through more than two or three blog posts) but I think the difference here is that there’s no physical reality to intimidate us.

We don’t have to grow particularly attached to anything on the internet because it’s not “in our lives” we’re in its life. It doesn’t take up space in our house, we take up space in its house. For once in our lives we find that we aren’t the ones who are at the mercy of demand, but are – in fact – in demand.  It’s a feeling of empowerment that is slowly but surely changing the world. Mark my words children n classrooms 100 years from now will be studying the historical impact of all of the events which are happening before our eyes at this very moment, in this place that’s not a place.

I think I’ve become side tracked.  Oh yes, consumers being in demand, corporations unable to handle the discrepancies of the actions of the same people online and off line, and… Ahh yes… The underdog.

Why, do you think, it is that in this virtual world so often it’s the couple of guys who met in college coding outside a cafe, or this dude in his moms basement, or a couple of people who tried to do one thing but failed fantastically into doing something else completely right?  Because people of talent are, all of a sudden, relinquished of the necessity to offer anything physical… People with a talent for the ethereal, all of a sudden, have a place in which the ethereal acquires value.

And, as in any underdog story, these small (sometimes rare) meteoric rises to the top will carry others with them.  And these are the kind of people who remember the hands that helped them up.

So, sure, bandwidth and all that.  But the people who make it easiest for those suited to developing the intangible will have everything to gain in the long run. Amazon sees this, and is doing an amazing job with it.  Their recent successes with S3, SQS, and EC2, are testimony to their understanding of this new ecosystem.  But they ought not to think that Google and Microsoft haven’t noticed this and where the young blood is heading.

Make no mistake, amazon has made extremely agile, grassroots, moves to “be the internet” from the bottom up… But there will soon be a clash of services as G and M do the same from the “top down” and “sideways in” respectively.

I will say this: The first company to crack the database problem will have a distinct advantage in the struggles to come.

Disclaimer: Everything I just said is more than likely to be complete nonsense as I just kind of rambled it out “stream of consciousness” style .

The (theoretical) web services database

I’ve been kind of floating around this topic for a while… Well databases in general… And I see a lot of people who have rather high standards (which is not a bad thing.)  I imagine the complication of offering a service like this comes from the fact that database people have very stringent standards.

Things like ACID transactions, Foreign keys, Table/Row/Column/Field read/write locking, always come up in these types of conversations.  I suppose that this is so because it’s been the standard for so long… It’s just how people *think* about databases… Which means that its what databases should be, right? Right?

Well not long ago the people at Amazon rethought process communication, and rethought storage, and then rethought servers.  Perhaps its about time they rethought the database as well.  I have a hunch (as others have noted here before) that they already are!

I really think that a lot, and I mean a LOT, could be done with a very simple model.

  1. Tables are their own island (no foreign keys)
  2. simple auto Incrementing PK’s
  3. every column indexed
  4. only simple operators supported ( =, >, < , !=, is null, is not null )

Heresy! Ack! Foo! Bar! NO! THATS NOT A REAL DATABASE.  Well, no, not as you mean by “real database” but it certainly is a database.  And I expect it would be good enough for 85% of peoples wants, needs, and desires.

We’ve learned that delays in storage give us permanence.  We’ve learned that the pipeline is a good (and global) thing, and we’ve learned that impermanence gives us expandability.  Necessity being the mother of invention I expect that something like this will be out soon, and I expect that people will learn to be perfectly happy with it.  It’s all about flexibility and agility here people!
It’ll come, people will complain, it’ll work, and as time goes on, I think it’ll get better and better.

Distributed MySQL Via Web Services?

Imagine for a moment, if you will, making your MySQL queries via a REST API. Weird, huh? I’ll admit its a crazy idea, but then a lot of my ideas are crazy. Still. Work with me here.

Query –> || REST API ||

  1. The query is a select
  1. Rest API synchronously determines both which servers are up and which is the fastest to respond.
  2. The API connects to the server with your user name and password (specified in the request header)
  3. The query is run on that server, and the response is passed back through to you.
  4. Connection closed
  • The query is an Insert/Update/Delete
    1. Rest API synchronously determines both which servers are up and which is the fastest to respond.
    2. The API verifies your credentials against that server, and gives you a Query ID
    1. You can then re-query the API with the Query ID to determine if the query has been fully replicated.
  • The API writes the query into replication directories, a la slurpd
  • The query is then passed along to all of the real MySQL servers
  • Plenty of details to iron out here, but it’s certainly feasible… And definitely interesting…

    One Resource to Rule Them All!

    One resource to rule them all,

    One resource to find them,

    One resource to bring them all,

    And in the darkness bind them,

    In the land of server where the shadows lie.

    It’s been a bumpy road to peoples understanding of the EC2 service. And a large part of the problem is a point of view gap between the masses, and Amazon. It’s a lot like an American visiting India wondering why he cant order a steak (disclaimer: I don’t actually know whether you can order a steak in India, but the point essentially remains.) They have a different point of view in regards to the cow.

    So too does amazon have a different point of view on resources. Your average web guy sees a server as a single resource: “that server is very powerful it could do a LOT” or “thats an old server, not a lot can be done with it” Because for so long we was able to get X number of servers, those servers would be assigned roles, and thats what they were. A better server could crawl more pages, or store a larger database, or serve more page views. And of course this meant that the server was specific to the application. But this model gets more and more difficult to maintain as the project gets larger and larger. Anyone who’s gone from 15 to 85 servers knows this. And it boils down to one single point: Permanence does not scale.

    So the amazon guys decided to look at things differently. Your basic components of a server are Mhz, RAM, Bandwidth, and disk space. And they look at a server as a pool of those specific resources. You don’t have 15 good servers, you have 180,000 Mhz, and 120 Gb of ram, and 13,500 Gb of disk space.

    And since permanence doesn’t scale… permanence is built OUT. This is a difficult concept to grasp for most people, and building an application which doesn’t rely on permanence is difficult (myself included!) It’s a learning process, but a necessary one. Once people learn to put permanence in the right places — once we all figure out the tricks of the trade I am of the opinion that the web as a whole will become a much more stable place.

    There certainly will be some growing pains though. For example right now a huge pain the dependence on popular database products (MySQL, PostgreSQL) which are wonderful, don’t get me wrong, But they are, currently, limited to the realm of the server, instead of the realm of the cloud
    So lets all put our heads together and start thinking of ways in which we can make use of the cloud as a cloud. We can do this!

    Amazon Ec2 Cookbook: Startup Flexibility

    Disclaimer: these code segments have not been really “tested” verbatim. I assume anyone who is successfully bundling EC2 images (that run) will know enough about copying shell scripts off blogs to test for typos, etc! Oh, and, sorry for lack of indentation on these… I’m just not taking the time 🙂

    As I’ve been searching for a way of using ec2 in a production environment. To keep it as simple as possible, but also eliminate the need for unnecessary (and extremely tedious) image building both during and after the development process (development of the AMI, not the service). This is what I’ve come up with.

    Step 1: our repository
    Create a subversion repository which is web accessible and password protected (of course) like so:

    • ami/trunk/
    • ami/trunk/files/
    • ami/tags/
    • ami/tags/

    ami/tags/ would read:



    ## Prepare the bootstrap directory
    echo -en “\tPreparing… ”
    if [ -d /mnt/ami ]
    rm -rf /mnt/ami
    mkdir -p /mnt/ami/
    if [ $? -ne 0 ]; then exit $?; else echo “OK”; fi
    ## populating the bootstrap
    echo -en “\tPopulating… ”
    svn –force \
    –username $BootUser \
    –password $BootPass \
    export \
    $BootProtocol://$BootHost/$BootLocation/ \
    /mnt/ami/ 1>/dev/null 2>/dev/null
    if [ $? -ne 0 ]; then exit $?; else echo “OK”; fi
    chmod a+x /mnt/ami/
    ## hand off
    echo -e “\tHanding off to init script…”
    exit $?

    ami/trunk/ would read something like:

    ## Filesystem Additions/Changes
    echo -en “\t\tSynchronizing System Files… ”
    cd /mnt/ami/files/
    for i in $(find -type d)
    mkdir -p “/$i”
    echo -en “d”
    for i in $(find -type f)
    cp -f “$i” “/$i”
    echo -en “f”
    echo ” OK”
    ## Any Commands Go Here
    ## All Done!
    exit 0

    Step 2: configure your AMI

  • create /etc/init.d/servicename
  • chkconfig –add servicename
  • chkconfig –levels 345 servicename on
  • /etc/init.d/servicename should look something like:

    #! /bin/sh
    # chkconfig: – 85 15
    # description: Ec2 Bootstrapping Process
    case “$1” in
    /usr/bin/wget \
    -o /dev/null -O /mnt/ \
    http://user:[email protected]/ami/tags/
    /bin/bash /mnt/
    exit 0
    $0 start
    echo “Usage: $0 {start|stop|restart}”
    exit 1
    exit $RETVAL

    And now when the AMI boots itself up we hit 85 during runlevel 3 bootup (well after network initialization), servicename starts, and the bootstrapping begins. We’re then able, with our shell scripts, to make a great deal of changes to the system after the fact. These changes might be bugfixes, or they might be setup processes to reconstitute a database and download the latest code from a source control repository located elsewhere… They might be registration via a DNS API… anything at all.
    The point is that some flexibility is needed, and this is one way to build that in!