Remember when I mentioned that we’ll be seeing some disilusionment?

Well this morning I found my first piece of it. A blog post (which I’ve since closed and dont feel like finding again) mentioned that the “fly in the ointment” was that he wouldnt be able to load an exceptionally large database into the EC2 grid.

The reason for this complaint is the limit of an image to 160Gb (apparently) so minus about 4Gb for the OS and you have 156Gb worth of space on which you could but a database. Now to a lot of you thats seems like a lot of data… and it is… but if you’re processing large amounts of data (research, social stuff, search, etc, etc) 500GB, 1TB, 2TB, 4TB arent unheard of.

Scientific applications aside the most likely place you’ll see people desiring dtabases lartger than 160Gb is probably going to be in the web 2.0 startup group. And I really hav to question the validity of their gripe. If you call in the heavy hitters at MySQL (And I know this, because I’ve worked in a company in which I *HAVE* called in the heavy hitters at MySQL) They’ll tell you, invariably, to spread out your data.

The reasons for spreading our your data are numerous, but the 2 big ones are physical scalability and speed. Lets assume you’re developing your web app… and right now the data is 500Mb. Its fast, its responsive, life is wonderful. Even on your poorly designed monollithic schema. But you take off and start growing at 2x per week! Wonderful! Traffic is GOOD! … … … well…

  • 500Mb
  • 1GB
  • 2GB
  • 4GB
  • 8GB
  • 16GB
  • 32GB
  • 64GB
  • 128GB
  • 256GB
  • 512GB
  • 1TB

Not only does it become increasingly difficult to find a place to store (and backup) your data… its gotten SLOWER… the indexes arent enough, things take a long time… and you dread running an ALTER TABLE more than your mother in law (disclaimer: I like my mother in law just fine)

The solution is to design your schema in such a way that breaking it into pieces is a natural process! If you do this you’ll be ready to maintain your users table in one database cluster on one set of hosts… the log files in another… and various pieces of your data in various places. How to develop a schema which can do this is a whole nother blog post (one which I’ll be very happy to write up if someone requests it)

The point is this: Just because someone has a valid gripe with the service doesnt necessarily mean that its well founded. Try and think through whats being said for yourself — you’ll end up much happier that way.

Oh, and if the OP is reading this… Perhaps I’d be willing to comment/help on your schema? 😉

Leave a Reply