So we’ve touched a bit on what to look for in your database. The comments made were by no means specific, and the requirements will vary from place to place. But the underlying principals are what are really important there. Now lets move on to something a bit more specific. Backup.
There is an important caveat to this information: Nobody has done this enough to really have a set of scalable one-size-fits-all tools (or a tool chain) fit for the job… You’ll have to be OK with doing some in-house experimentation. And be OK with the idea of maybe making a couple of miss-steps along the way. As is the case with any new (OK new to YOU) technologies there are some things you just have to learn as you go.
To setup a system that is fault tolerant, and to develop a system in which you manage your risks requires a balance of acceptable versus unacceptable trade off situations. Your main types of backups are as follows:
A) Simple local backup. your old stand-by tar and his friends bzip2, gzip, and even compress. They’ve been doing backups since backups were backups (or almost anyhow) and they are good friends. In this kind of a situation they aren’t the whole solution but you can bet your butt that they’re a part of it.
B) Hard-Copy backup. This isn’t what you want, but worth mentioning. This kind of backups consists of hard disks, tapes, CDs DVDs, etc, which are copied to and then physically removed from the machine. The advantage to this type of backup is that you can take them offsite incase of a local disaster, but in an EC2+S3 business there is no such thing as a local disaster. So if you, once per week/month/whatever, just copy down your latest backups from S3 that should suffice.
C) Copy elsewhere backup. This is going to be bread and butter for the bulk of the solution. It’s not the entire solution. But it’s a fairly big piece. In this case S3 is your “elsewhere”
D) Streaming backups. Examples of streaming backups are MySQL’s replication, or pushing data into an Amazon SQS pipe for playback at a later point. Also a key player in what will surely be your ending strategy.
Well that was fun. Learning just enough to be dangerous but not enough to actually do anything… And certainly not enough to answer the question. So lets get to it.
You will have two distinct areas of backup which will be important to you. You have the UI end, and the DB end. Both these sections should be approached with different goals in mind, because the usage pattern on them ends up being different.
The Front End
You’ve no doubt got a development environment setup somewhere, and as you make bug fixes to this environment, or add features, or change layouts to announce your IPO, or whatever you need to push a snapshot to your servers *AND* any new servers you bring up need to have the new UI code and not the old UI code.
For the sake of argument, here, I’ll assume that you have a SVN or CVS server which holds your version-controlled code (you *ARE* using version control right?) So your build process should, ideally, update the stable branch on your Revision Control Server, and send out a message to your UI servers that an update is available. They should then download the new code from RCS to a temporary directory, and once there you pull the fast-move trick:
$ mv public_html public_html.$(date +%s) && mv public_html.new public_html
At this point all of your UI servers received the message at the same time, and update at the same time. Any new server should have, in its startup scripts sometime after the network is brought up, a process which performs the above update before even bringing up your HTTP service.
And that was the easy part… Now for MySQL
As for MySQL, I’ve outlined my thoughts on that here already in my article: MySQL on Amazon EC2 (my thoughts) Which options you choose here depend on a couple of things: First the skill level of the people who will be implementing the entire procedure *AND* the skill level of the people who will be maintaining it (if those people aren’t the same people). But one very serious word of caution: Whatever you do stop thinking of an EC2 instance as 160GB of space for MySQL and start thinking of it as 60Gb (70GB MAX) because backing up something that you do not have the space to copy is a difficult task which normally required bringing things offline — trust me on this.
My gut feeling for you is that if you owned/rented one physical server to be your write server for your database setup. something roughly equal to the specs of the EC2 virtual machine, except with 320Gb of disk space. That would be your best bet for now. You could keep your replication logs around for the entire history of your database… for a while
You also should keep one extra MySQL instance (on EC2 if you like) up and running for the sole purpose of being up to date. You would then periodically turn it off and copy the entire thing up to S3. So that when you had to restore a new instance you would simply copy those files down, assign the server-id, and let it suck everything new down via replication.
The gotcha here is that this wont last forever… at least not on one database. There will come a time, if you get a “lot” of usage, when the process of downing a server copying it, copying it, bringing it up and waiting for replication will become infeasible. It will eventually just stop adding up. It’s at that point you will have to make a couple of careful choices. If you have properly laid out your schema you can pull your single monolithic database apart, distribute it amongst several database clusters, and carry on as you have been. If you have properly laid out your schema in a different way you will be able to assign certain users to certain clusters and simply write a migration tool for moving users and their data around between database clusters. If you have not properly laid out your data you can choose whether to spend time and money re-working your application to make it right. Or you can spend time and money on buying big “enterprise class hardware” and give yourself time to make things right.
Unless you can truly count on being able to bleed money later on. You’ll VERY CAREFULLY consider your schema now. It will make all the difference. And if you end up with 2+TB of data which is completely unmanageable… well don’t say I didn’t warn you… Those kinds of optimizations may seem silly now when you’re only expecting 5-25GB of data but they wont be silly in 2-4 years.