Who this document is for: People looking to house large MySQL data-sets on Amazon’s EC2 service, and people looking for the best (that I’ve found) all-in-EC2 solution for fault tolerance and data retention. People looking to get maximum availability.
Who this document is not for: People who are looking for something EASY. This isn’t it. People who have a small data-set which lends itself to just being copied around. And people to whom 100% uptime isn’t an issue. For all of you there are easier ways!
The problem (overview): The EC2 service is a wonderful thing, and it changes the nature of IT, which is no small feat! But there are a couple of problems with the service which make it less than ideal for some situations. To be more clear there is a problem with the ways that people are used to doing things as compared to the ways that things ought to be done with a service (or platform) like EC2. So, as I’ve advocated before, We’re going to look at shifting how *YOU* are thinking about your databases… And as with all change I promise this to sound bizarre and *BE* painful. Hang in there. You can do it!
The problem (in specific): There are two things that an EC2 AMI (which is what the amazon virtual machines are called) are lacking. The first and most direct of the two is that EC@ lacks immutable storage. At this point I would like to point out two things: A) EC2 is still in *BETA* lets not be too critical of the product until it hits prime time, okay guys?, and B) the AWS team is working on an immutable storage system to connect to EC2 (so sayeth the forums mods.) The lack of immutable storage means this: after you turn your machine on… you download and install all available security fixes… and you turn it off to play with later. When you turn it back on your machine again needs all of those security fixes… everything you do with your machine during runtime is LOST when the machine shuts down. You then boot, again, from a clean copy of your AMI image. The second problem is that you are not given a static IP for use with your AMI machine. Though this is the lesser of the two issues it’s more insidious. The two above “issues” lend themselves well to setting up a true cluster… but they don’t lend themselves at all to setting up a database cluster.
While discussing solutions for these problems let me lay the docroot bare here. I will be discussing how to work inside the limitations of the EC2 environment. There are better solutions than those I’m going to be touching on but I’m not a kernel hacker. I’ll be discussing things that you can do through architecting and normal system administration which will help you leverage EC2 in a consistent manner. We’ll also be assuming here that EC2 is a trustworthy service (e.g. i something breaks its your fault… and if its the fault of amazon that no more than 1 of your servers will go down). The method here is a lot like taking your dog to obedience class. The teacher at this class trains the dog owners… not the dog. Once the dog owners understand how to train the dog the “problem” solves itself.
Step #1: You are to drop the term (and idea of) monolithic databases from your brain. Don’t think it. Don’t utter it. I’ve touched on this briefly before (and if I haven’t I will in later posts.) As you design your database make sure that it can be split into as many databases as is needed in the future. And, if at all humanly possible, split horizontally instead of vertically! This not only ensures that you can keep your database instances under control it also, in the long run, carries your good performance a long long LONG way. You can control your size by splitting vertically (e.g. records 1-1,000,000 are in A, 1,000,001-2,000,000 are in B, 2,000,001-3,000,000 are in C. but this limits your speed on a given segment to the performance of the housing instance — don’t do that to yourself (You’ll regret it in the morning!) but if you have all records ending in 0 in A, ending in 1, in B, ending in 2 in C (and so on) you are able to take advantage of the fact that not only are the database footprints only 1/10th the disk size that they would have been monolithically but you also get 10x the performance increase once you have it on 10 different machines (later on.) And the beauty is that this scheme extends itself very well to even larger data-sets. use 00, 01, 02… or 001, 002, 003 for 1/100, 1/1000 (and beyond… you get the idea) These don’t all have to be housed on different servers to start off with. It’s good enough that the databases be setup properly to support this in the future.
The standard mechanism for fault tolerance in MySQL is replication. But there are a couple of things that people seldom realize about replication, and I’ll lay them bare here.
The first thing that people don’t understand is that you cannot keep your binary logs forever. OK thats not entirely true – if you don’t write to the db very often. But if you have a write intensive database you will eventually run out of storage. it’s just no fun to keep 900GB of binary logs handy! It also becomes impractical, at some point, to create a new database instance by re-reading all of the binary logs from a master. Processing all of your 75 billion inserts sequentially when you need a server up… NOW… is not fun at all! Not to mention the fact that if you, at some point ran a query which broke replication… you’ll find that your rebuilding has hung at that point and wont progress any further without manual intervention.
The other thing that people don’t realize is that repairing a broken (or installing a new) database instance means that you have to take an instance down. Imagine the scenario: you have two db servers, a master and a slave. The hard drives on the slave give out. You get replacements and pop them in the slave. Now it’s time to copy the data back over to the slave. Your options? A) run a mysqldump bringing your master to a crawling halt for the 8 hours it takes. or B) turn the master off, and copy the data manually taking much less time but bringing everything to a complete halt. The answer to this is, of course, to have at least one spare db instance which you can shut down safely while still remaining operational.
Step #2: I’m half the instance I used to be! With each AMI you get 160GB of (mutable) disk space, and almost 2GB of ram, and the equivalent of a Xeon 1.75Ghz processor. Now divide that, roughly, in half. You’ve done that little math exercise because your one AMI is going to act as 2 AMI’s. Thats right. I’m recommending running two separate instances of MySQL on the single server.
Before you start shouting at the heretic, hear me out!
+-----------+ +-----------+ | Server A | | Server B | +-----------+ +-----------+ | My | My | | My | My | | sQ | sQ | | sQ | sQ | | l | l | | l | l | | | | | | | | #2<=== #1 <===> #1 ===>#2 | | | | | | | + - - - - - + + - - - - - +
On each of our servers, MySQL #1 and #2 both occupy a max of 70Gb of space. The MySQL #1 instances of all the servers are setup in a master-master topography. And the #2 instance is setup as a slave only of the #1 instance on the same server. so on server A MySQL #2 is a copy (one way) of #1 on server A.
With the above setup *if* server B were to get restarted for some reason you could: A) shut down the MySQL instance #2 on server A. Copy that MySQL #2 over to Both slots on server B. Bring up #1 on server B (there should be no need to reconfigure its replication relationship because #2 pointed at #1 on server A already). Bring up #2 on server B, and reconfigure replication to pull from #1 on ServerB. This whole time #1 on Server A never went down. Your services were never disrupted.
Also with the setup above it is possible (and advised) to regularly shut down #2 and copy it into S3. This gives you one more layer of fault tollerance (and, I might add, the ability to backup without going down.)
- Why can we do this? and why would we do this?
We CAN do this for two reasons: first MySQL supports running multiple database servers on the same machine (thankfully.) second because we’ve set up our database schema in such a way that we can easily limit the space requirements of any given database instance. Allowing us to remain, no matter what, under the 70Gb mark on all of our database servers. - Why WOULD do this for a couple of reasons, let me address specific questions individually
Why would we reduce our performance by putting two MySQL instances on one AMI? Because you’re a poor startup, and its the best alternative to paying for 4 or more instances to run, only, mysql. You could increase performance by paying for one AMI per database instance and keep the topography the same. I expect that once you CAN do this… you WILL do this. But likely the reason you’re using AMI is to avoid spending much capital up front until you make it big with some real money. So I’ve slanted this hypothesis with that end in mind.
- Why would we do something so complicated?
MySQL replication is complicated. It’s error prone. It’s harder (in the long run) than it looks. We use it, and this entire method of managing MySQL on AMI’s because its what we have available to us at our budget. Are there better overall solutions? Without placing the limitations that I’m constrained to here: yes! But again. We’re workign solely inside the EC2 framework…
- Why would we do something so susceptible to human error?
You’ve obviously never had someone place a hard copy in the wrong file folder. Or type reboot on the wrong machine. Or deleted the wrong file on your hard drive. If you think that operations (on a small scale) is any less error prone you’re fooling yourself! If you’re looking for speed and agility from your OPS team you have to trust them to do the best they can with the resources given. If you’re stuck *having* to use EC2 its likely because of budget and we’re looking at a circular set of circumstances. Make some good money and then hire a big ops team so that that can set in place a swamp of processes. The theory being the slower they have to move the moe they get a chance to notice something is wrong 🙂
- What would you recommend to make this solution more robust if you were able to spend a *little* more money?
I would put one or each replication cluster instance on an actually owned machine. Just in case we’re looking at an act-of-god style catastrophe at amazon… You’ll still have your data. This costs A) a server per cluster, and the bandwidth to support replication.
And finally what problems will arise that I’m not yet aware of?
A couple that I haven’t touched, actually.
- First MySQL replication requires that the server-id be a unique number for each instance in a cluster of computers. And each machine is running 2 instances of mysql (meaning two unique server ID’s per AMI.) The reason this is an issue is because every time you start your AMI instance the original my.cnf files will be there again, and without intervention all of your servers would end up having the same server ID, and replication would be so broken it will take you years to piece your data back together!
The easy way to circumvent this issue is to have a specific custom AMI build for each of your servers.
The elegant long-term solution is to devise a way, programatically (possibly using DNS, or even the Amazon SQS service) to obtain two unique server ID’s to use before running MySQL.
- Second: without static IP addresses from the EC2 service your AMI’s will have a new IP every time the server boots.
this can be dealt with either manually or programatically (possibly via a DNS registration, and some scripts resetting MySQL permissions.) - Third: if, rather like a nursery rhyme which teaches children to deal with death by plague in medieval europe, “ashes ashes they all fall down” what do you do?
well hopefully they never “all fall down” because resetting a cluster from scratch is tedious work. But if they do you better hope that you took one of my two backup options seriously.
either you have a copy of a somewhat recent data-set in S3, or you have an offsite replication slave which can be used for just this circumstance..or you’re back to square one…
Interesting idea on splitting the database vertically. Can you use it with a database that doesn't support this at the database level? Did you actually use this in a real database? I am wondering how you will handle queries like getting all child records for a parent. Say, Widgets have Sprockets, and we are splitting the data across three databases. So we have:
DB1 DB2 DB3
———- ———— ————–
Widget 1 Widget 2 Widget 3
Sprocket 11 Sprocket 12 Sprocket 13
Sprocket 21 Sprocket 22 Sprocket 23
Sprocket 31 ……………….
How will you write a query to get all Sprockets for Widget 2? Also you need to assign the IDs for your records in your middle tier code, and can't use database features like autoincrement columns or GUID column.
The short answer is that you have to build the data structure for your data. And the long answer is that, assuming you are housing a lot of data on each sprocket, with a lot of keys, etc, it will likely be (at the hundreds of millions of rows level) faster to have a phone book table just dealing with relationships between your widgets and sprockets. It will also save you space and allow for your horizontal expansion. in a particularly stubborn schema.
However I'm pretty sure that if you really sat down with the data you would see an overall workable pattern emerge with a minimum of edge cases like this. But keep in mind that scaling is, for everyone, a big headache. You cant expect a transition like this to come without some work, I promise there will be work. But what it does give you is the ability to spread out.
As an analogy: a paper clip set onto water standing vertically sinks, but a paper clip placed carefully and horizontally will float. The horizontal paper clip floats because it is able to distribute its weight over a much larger area, and the waters infrastructure (sorry, surface tension) is able to support the same paper clip in this new configuration.
As for the use of auto_incriment you have two options. You can iether a) use a database table which supports transactions and use a 2 column approach for managing your row id's
Table: auto_inc
Col: data_type Col: last_id
—–
begin transact;
update auto_inc set lastid=last_id+1 where data_type=widgets;
select …; end
transact;
(the above is obviously not sql, really, you'd have to translate it.)
—–
Otherwise you could use these two little tricks in tandem first when creating your tables make sure to set your AUTO_INCREMENT (see http://dev.mysql.com/doc/refman/5.0/en/example-au… , then make sure your server's auto_increment_increment is set to 10 (for example)_ (see http://dev.mysql.com/doc/refman/5.0/en/server-sys…
I think any attempt to do reliable persistence on EC2 without using S3 is going to be hard. Even if you bring up multiple AMIs for 'classic' fault tolerance, there's no way in the query or hideous SOAP API (that I've seen) to spec that you want them in separate physical nodes. You get pretend fault tolerance, just like all those companies that tell their telco they wanted redundant T-1 lines and ended up having them merged without their knowledge into the same OC-3 line -something they only find out when the line gets broken.
postgres has that log you could stream to S3 and then play back to recreate, or you could use the messaging service itself ($$$). Maybe we need a persistence engine that is S3 aware.
Splitting a monolithic table to smaller multiple tables.. Here, are you suggesting to use the MySQL Partitioning?
N: What I mean is designing your data storage mechanisms in such a way that you never have to store more than X GB on any single node
Steve: While it's true that you cannot specify which physical nodes that you want your servers running from it's also true that there are exactly 4 virtual systems to a physical system. Therefor if you have 10 instances up you're guaranteed to be on at least 3 physical nodes. 5 instances guarantees you 2.
So in a classic tagging scenario, where we have a tags table, items table, users table, and tag-item-user table… how would I be able to design the DB such that it can be split the way you suggested: "have all records ending in 0 in A, ending in 1, in B, ending in 2 in C (and so on)"
I think very soon I'll find myself in a situation where EC2's 160GB (actually 70GB) won't be enough for our DB.
So MySQL's recent Partitioning feature (introduced in 5.1) won't be of much help here?
If I'm able to have multiple database instances the way you're suggesting.. should a federated database be used to merge all these disparate database instances? If so, would this mean huge performance impact while performing queries?