Is compute as a service for me?

Note to Nick: I havent forgotten your request and I’ll have something on that soon, but when I started in I found that I had something else to say about compute-on-demand (or compute-as-a-service – terms which i use somewhat interchangably) So here it is. For all those people just jumping into a project or considering restructuring a project around these new trends I hope this helps. I would definately not consider this (or anything else I write) a GUIDE per se, but food for thought.

We live in an interesting world, now, because every new tech project has to ask itself at least one very introspective question: “is computing as a service the right thing for us?” And MAN is that ever a loaded question. At first blush the answer seems like a no brainer: “of course its for us! we don’t want to pay for what we don’t use!” Which is, at the basest level, true. But the devil is always in the details…

So which pair of glasses do you have to approach this problem with? What are the consequences of choosing wrong? How do we do it? Slow down. First you need to put some thought into these two questions: “what do we do?” and “how do we do it?” Because that is the foundation of which road holds the path to success and which to failure.

Are you a media sharing service which houses a billion images and gets thousands more every second? Are you a news aggregator which houses millions of feeds hundreds of millions of posts? Are you a stock tracking company which copes with continuous feeds of data for portions of the day? are you a sports reporting company who has five to twenty posts per day but hundreds of thousands of reads? Are you a modest blogger? Do you just like to tinker with projects?

As you can see all of those are very complex environments with unique needs, stresses, and usage spreads. And writing a blog entry which addresses whether each possible type of business should or shouldn’t use on demand computing would be impractical, not to mention impossible. But for the web industry there are a couple basic types of environments: “Sparse Write, Dense Read”, “Dense Write, Sparse Read”, with subtypes of “Data Dense” and “Data Sparse”

Environment: Sparse Write, Dense Read

For a lot of web applications you’re really not dealing with a lot of data. If you’re running a content management system or you’re a directory you have a finite amount of data which, in comparison with the number of times it’s read, is written to fairly infrequently. (In this case Infrequently written means that a databases query cache is a useful optimization for you.) It’s also very likely that you will be able to take a snapshot of your data in this type of environment in a fairly convenient manner. Compute as a service is probably right up your alley, and here’s why.

You are likely to have very normalized times during which your reads (or your writes) spike, meaning that you can actively plan for, setup, and use on demand resources to their fullest potential. Remember that an on demand resource is not an instant problem solver. In the case of something like Amazon EC2 it can take 5, 10, or 15 minutes for the server you’ve requested to even become active. After the server is up there has to be some process which gets all of the relevant data on it up to date. What this means is that you might be looking at 1/2 an hour before your 5 extra servers are ready to handle the 7:00 am to 9:00am traffic spike that everyone getting to the office in the morning generates. With your service, thats fine though. Just plan to turn the extra power on an hour early and turn it off half an hour after you expect the spike to be over. Wash rinse repeat.

Environment: Dense Write, Sparse Read

See this is the more complicated of the two environments. Everyone and their mother knows how to build a database driven application which gets a few writes and a lot of reads because thats what your common RDBMS are built for. Think of it as being idiot proofed out of the box 🙂 But when you have a backwards (as in usage, not as in technology) environment all of a sudden you have a lot of “conventional wisdom” which isn’t so wise anymore (what do you mean a faster write server than read servers causes replication problems?) (what do you mean my uber-normalization is the problem?).

It’s in this type of environment when we really have to look at the subsets of data, because the proof really lies in the pudding — so to speak.

Sub Environment: Data Sparse

You work with a relatively small window of data in realtime. You may or not get a request for all of the data you’re keeping continuously up to date, but you have to keep it that way or its your butt on the line, right? Well you’re probably in luck. I think it’s fairly likely that your data size is a relatively small one, for example you’re keeping a window with a period of 24 hours of data updated. Likely there is a *LOT* of history kept but thats kept elsewhere. Once you’re done with the data you shove it right out the backend into another process and it gets handled there (that backend is likely a sparse write sparse read environment which is extremely data dense — not for on demand computing (well maybe, but thats another blog post)).

For this environment compute as a service is probably going to be a godsend… if you can overcome one small, teentsy weentsy, ever so small yet still important detail: the development team. Now not all companies are going to have difficult development teams, but some do, and you simply cannot build an environment ripe for compute as a service without their cooperation, so be prepared whatever the case! You will likely be able to leverage hotcopy, or an LVM style live-action backup for insta-backups to your long term storage solution (or on-demand setup pool). You will likely be able to leverage the extra compute capacity for your peak load times. And everything will likely turn out OK. So long as you can get some of the crucial application details hammered out.

Sub Environment: Data Dense

I pity you. Compute as a service is probably not what you need. Cases may vary and, again, the devil is in the details. But you have a huge challenge ahead of you: Building an environment where a server can be programatically brought online and then caught up to date with the current compute pool in a time frame which makes even doing it a winning situation. This is something I’m going to put a lot of thought into… note to self… But unless you have some bright ideas here (and if you do, please send them my way) you have basically one chance: data partitioning. Get yourself a VERY good DBA, and really REALLY plan out your data. If you put enough thought into it in the beginning you have a chance to keep the individual pieces of data down to a small enough (and distributed enough) level which just might lend itself to compute as a service in a very LARGE way (but we’re really talking about going WAY beyond the 10 or 20 allowed Amazon EC2 server instances here)

Uh, Ok, Enough about these different environments… what do I need to do to USE on demand computing?

Well thats a difficult question to answer in a generally useful way. so without getting too specific:

You, first and foremost, need to have compute as a service thought about in every bit of your planning and executing stages. At every point in the set of long chains which make up your application you have to ask yourself “what happens if this goes away?” and plan for it.

A very close second is think pipes and hoses rather than chisel and stone. Each part of your environment should be as self contained as possible. When one hose springs a leak the solution is simple, replace the hose (and just bypass it for the mean time,) but when you loose a section of your monolithic structure things are a bit more complicated than that.

Finally you need to understand that you will have to work at taking full advantage of compute as a service. Remember that you are going to have to put TIME and ENERGY into using this kind of a service. Nothing comes free, and even in operations everything has an equal and opposite reaction. If you want to avoid spending the time and energy and money maintaining a hardware infrastructure you will have to put the same into avoiding one. But the benefits of doing so are real and tangible. Because when you’ve spent all of your time building an application which is fault tolerant rather than building an infrastructure which will fail you invariably provide to your user base a more robust and reliable service.

MySQL on Amazon EC2 (my thoughts)

Who this document is for: People looking to house large MySQL data-sets on Amazon’s EC2 service, and people looking for the best (that I’ve found) all-in-EC2 solution for fault tolerance and data retention. People looking to get maximum availability.

Who this document is not for: People who are looking for something EASY. This isn’t it. People who have a small data-set which lends itself to just being copied around. And people to whom 100% uptime isn’t an issue. For all of you there are easier ways!

The problem (overview): The EC2 service is a wonderful thing, and it changes the nature of IT, which is no small feat! But there are a couple of problems with the service which make it less than ideal for some situations. To be more clear there is a problem with the ways that people are used to doing things as compared to the ways that things ought to be done with a service (or platform) like EC2. So, as I’ve advocated before, We’re going to look at shifting how *YOU* are thinking about your databases… And as with all change I promise this to sound bizarre and *BE* painful. Hang in there. You can do it!

The problem (in specific): There are two things that an EC2 AMI (which is what the amazon virtual machines are called) are lacking. The first and most direct of the two is that EC@ lacks immutable storage. At this point I would like to point out two things: A) EC2 is still in *BETA* lets not be too critical of the product until it hits prime time, okay guys?, and B) the AWS team is working on an immutable storage system to connect to EC2 (so sayeth the forums mods.) The lack of immutable storage means this: after you turn your machine on… you download and install all available security fixes… and you turn it off to play with later. When you turn it back on your machine again needs all of those security fixes… everything you do with your machine during runtime is LOST when the machine shuts down. You then boot, again, from a clean copy of your AMI image. The second problem is that you are not given a static IP for use with your AMI machine. Though this is the lesser of the two issues it’s more insidious. The two above “issues” lend themselves well to setting up a true cluster… but they don’t lend themselves at all to setting up a database cluster.

While discussing solutions for these problems let me lay the docroot bare here. I will be discussing how to work inside the limitations of the EC2 environment. There are better solutions than those I’m going to be touching on but I’m not a kernel hacker. I’ll be discussing things that you can do through architecting and normal system administration which will help you leverage EC2 in a consistent manner. We’ll also be assuming here that EC2 is a trustworthy service (e.g. i something breaks its your fault… and if its the fault of amazon that no more than 1 of your servers will go down). The method here is a lot like taking your dog to obedience class. The teacher at this class trains the dog owners… not the dog. Once the dog owners understand how to train the dog the “problem” solves itself.

Step #1: You are to drop the term (and idea of) monolithic databases from your brain. Don’t think it. Don’t utter it. I’ve touched on this briefly before (and if I haven’t I will in later posts.) As you design your database make sure that it can be split into as many databases as is needed in the future. And, if at all humanly possible, split horizontally instead of vertically! This not only ensures that you can keep your database instances under control it also, in the long run, carries your good performance a long long LONG way. You can control your size by splitting vertically (e.g. records 1-1,000,000 are in A, 1,000,001-2,000,000 are in B, 2,000,001-3,000,000 are in C. but this limits your speed on a given segment to the performance of the housing instance — don’t do that to yourself (You’ll regret it in the morning!) but if you have all records ending in 0 in A, ending in 1, in B, ending in 2 in C (and so on) you are able to take advantage of the fact that not only are the database footprints only 1/10th the disk size that they would have been monolithically but you also get 10x the performance increase once you have it on 10 different machines (later on.) And the beauty is that this scheme extends itself very well to even larger data-sets. use 00, 01, 02… or 001, 002, 003 for 1/100, 1/1000 (and beyond… you get the idea) These don’t all have to be housed on different servers to start off with. It’s good enough that the databases be setup properly to support this in the future.

The standard mechanism for fault tolerance in MySQL is replication. But there are a couple of things that people seldom realize about replication, and I’ll lay them bare here.

The first thing that people don’t understand is that you cannot keep your binary logs forever. OK thats not entirely true – if you don’t write to the db very often. But if you have a write intensive database you will eventually run out of storage. it’s just no fun to keep 900GB of binary logs handy! It also becomes impractical, at some point, to create a new database instance by re-reading all of the binary logs from a master. Processing all of your 75 billion inserts sequentially when you need a server up… NOW… is not fun at all! Not to mention the fact that if you, at some point ran a query which broke replication… you’ll find that your rebuilding has hung at that point and wont progress any further without manual intervention.

The other thing that people don’t realize is that repairing a broken (or installing a new) database instance means that you have to take an instance down. Imagine the scenario: you have two db servers, a master and a slave. The hard drives on the slave give out. You get replacements and pop them in the slave. Now it’s time to copy the data back over to the slave. Your options? A) run a mysqldump bringing your master to a crawling halt for the 8 hours it takes. or B) turn the master off, and copy the data manually taking much less time but bringing everything to a complete halt. The answer to this is, of course, to have at least one spare db instance which you can shut down safely while still remaining operational.

Step #2: I’m half the instance I used to be! With each AMI you get 160GB of (mutable) disk space, and almost 2GB of ram, and the equivalent of a Xeon 1.75Ghz processor. Now divide that, roughly, in half. You’ve done that little math exercise because your one AMI is going to act as 2 AMI’s. Thats right. I’m recommending running two separate instances of MySQL on the single server.

Before you start shouting at the heretic, hear me out!

+-----------+   +-----------+
| Server A  |   | Server B  |
+-----------+   +-----------+
| My  |  My |   | My  |  My |
| sQ  |  sQ |   | sQ  |  sQ |
| l   |  l  |   | l   |  l  |
|     |     |   |     |     |
| #2<=== #1 <===> #1 ===>#2 |
|     |     |   |     |     |
+ - - - - - +   + - - - - - +

On each of our servers, MySQL #1 and #2 both occupy a max of 70Gb of space. The MySQL #1 instances of all the servers are setup in a master-master topography. And the #2 instance is setup as a slave only of the #1 instance on the same server. so on server A MySQL #2 is a copy (one way) of #1 on server A.

With the above setup *if* server B were to get restarted for some reason you could: A) shut down the MySQL instance #2 on server A. Copy that MySQL #2 over to Both slots on server B. Bring up #1 on server B (there should be no need to reconfigure its replication relationship because #2 pointed at #1 on server A already). Bring up #2 on server B, and reconfigure replication to pull from #1 on ServerB. This whole time #1 on Server A never went down. Your services were never disrupted.

Also with the setup above it is possible (and advised) to regularly shut down #2 and copy it into S3. This gives you one more layer of fault tollerance (and, I might add, the ability to backup without going down.)

Why can we do this? and why would we do this?

We CAN do this for two reasons: first MySQL supports running multiple database servers on the same machine (thankfully.) second because we’ve set up our database schema in such a way that we can easily limit the space requirements of any given database instance. Allowing us to remain, no matter what, under the 70Gb mark on all of our database servers.
Why WOULD do this for a couple of reasons, let me address specific questions individually
Why would we reduce our performance by putting two MySQL instances on one AMI? Because you’re a poor startup, and its the best alternative to paying for 4 or more instances to run, only, mysql. You could increase performance by paying for one AMI per database instance and keep the topography the same. I expect that once you CAN do this… you WILL do this. But likely the reason you’re using AMI is to avoid spending much capital up front until you make it big with some real money. So I’ve slanted this hypothesis with that end in mind.
Why would we do something so complicated?
MySQL replication is complicated. It’s error prone. It’s harder (in the long run) than it looks. We use it, and this entire method of managing MySQL on AMI’s because its what we have available to us at our budget. Are there better overall solutions? Without placing the limitations that I’m constrained to here: yes! But again. We’re workign solely inside the EC2 framework…
Why would we do something so susceptible to human error?
You’ve obviously never had someone place a hard copy in the wrong file folder. Or type reboot on the wrong machine. Or deleted the wrong file on your hard drive. If you think that operations (on a small scale) is any less error prone you’re fooling yourself! If you’re looking for speed and agility from your OPS team you have to trust them to do the best they can with the resources given. If you’re stuck *having* to use EC2 its likely because of budget and we’re looking at a circular set of circumstances. Make some good money and then hire a big ops team so that that can set in place a swamp of processes. The theory being the slower they have to move the moe they get a chance to notice something is wrong 🙂
What would you recommend to make this solution more robust if you were able to spend a *little* more money?

I would put one or each replication cluster instance on an actually owned machine. Just in case we’re looking at an act-of-god style catastrophe at amazon… You’ll still have your data. This costs A) a server per cluster, and the bandwidth to support replication.

And finally what problems will arise that I’m not yet aware of?

A couple that I haven’t touched, actually.

First MySQL replication requires that the server-id be a unique number for each instance in a cluster of computers. And each machine is running 2 instances of mysql (meaning two unique server ID’s per AMI.) The reason this is an issue is because every time you start your AMI instance the original my.cnf files will be there again, and without intervention all of your servers would end up having the same server ID, and replication would be so broken it will take you years to piece your data back together!
The easy way to circumvent this issue is to have a specific custom AMI build for each of your servers.

The elegant long-term solution is to devise a way, programatically (possibly using DNS, or even the Amazon SQS service) to obtain two unique server ID’s to use before running MySQL.
Second: without static IP addresses from the EC2 service your AMI’s will have a new IP every time the server boots.

this can be dealt with either manually or programatically (possibly via a DNS registration, and some scripts resetting MySQL permissions.)
Third: if, rather like a nursery rhyme which teaches children to deal with death by plague in medieval europe, “ashes ashes they all fall down” what do you do?
well hopefully they never “all fall down” because resetting a cluster from scratch is tedious work. But if they do you better hope that you took one of my two backup options seriously.

either you have a copy of a somewhat recent data-set in S3, or you have an offsite replication slave which can be used for just this circumstance..

or you’re back to square one…

There is a real issue to be discussed…

Amazons EC2 service is, by all accounts, brillian. But one of the things that it lacks is any sort of assurance regarding data permanence. What I mean is each machine that you turn on has 160GB of storage, but if that server instance is ever shut off the data is *lost* (not corrubted byt GONE) and the next time you start that server instance it is back to the base image… you cannot save data on EC2 between reboots.

This is not an issue for everyone. But for people looking to run services on EC2 which require some sort of permanent storage solution (as in databases like MySQL or PotsgreSQL) this is a big show stopper. I’ll be putting some real thought in the next few days on how to sidestep this pitfall. I’m sure that it can be done, and I even have some ideas on how to do it. But I want to think them through and do a little research before I go blurting them out and making (more of) an ass out of myself 🙂

So… More on this later.

Well as my first step towards actually using EC2

I’m configuring myself a VMWare virtual machine running (well *instaling* right now) CentOS-4.4. Rather than worry about what to (and not to) install I’ve just opted for the 4CD “Everything” method. Hey… It works ;). I will be using this virtual machine to work with the EC2 tools. (yea… go figure. I work with Linux Servers, and BSD servers, but I dont have a Linux desktop. (I have stuck, recently, with OSX and Windows)

Yes… I know… Windows… But some things it’s just needed for. Blech

Soon I should have detailed accounts of running Amazons preconfgured instances, followed by accounts of creating a custom image and sending that back to amazon to run as an instance

Traditional OPS VS Amazon AWS (part 3)

So now we know that we need to look at things in terms of pools or “stacks” of resources… But how? Thats a good question! And from here on out… It gets technical (even though we’re going to be talking in high level generalities)

Now lets take on step back, and examine the tools that we have at our disposal

Amazon EC2 – scalable processing power, temporary storage
Amazon S3 – Scalable permanent storage, no processing power
Amazon SQS – Scalable queueing of “things”

First we need a control structure. We need a way to programatically interface with this potential pool of machines. And we need a way to, with as little hands on work as possible, be able to bring up and down the services that we need. For our mechanism of communication we will use Amazons Simple Queueing Services. According to Amazon: “Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable hosted queue for storing messages as they travel between computers. By using Amazon SQS, developers can simply move data between distributed application components performing different tasks, without losing messages or requiring each component to be always available”

We’ll start large-scale and work our way down to the fine grained detail. Our global Queue structure. We’ll have one Global queue structure which will be used by our orchestrator to communicate with machines who have no assigned role yet, and then sub queues which relate to a specific server.

   [ Global Queue --------------------------------------------------- ]
   [ Web1 Q ] [ Web2 Q ] [ Db1 Q ] [ Db2 Q ] [ Sp1 Q] [ Sp2 Q ] [ O Q ]

The sub-queues will be used for monitoring as well as giving orders by the orchestrator, and the [O]rchestrator queue will be for subordinate servers to send messages back to the orchestrator

Oh, yea, him! At this point we have one machine known as the orchestrator. And it will be acting as the brain for the operation. It will be the one that actually manages the servers — Ideally they will require no intervention.

This orchestrator server will be configured to maintain a baseline number of each type of server at all times. It will monitor the vitals of the servers under its command, most important of which will be server load, and server responsiveness. If the average load of a given set of servers goes above a pre-configured threshold it will throw a signal down into the global queue asking for a new server of that type. If the average load drops below a given threshold it will send a message down the specific queue asking that a server be decommissioned. Also if a server is unresponsive it will call for a new server to be commissioned, and it will decommission the unresponsive server.

The last thing that the orchestrator will be responsible for is keeping a set number of spare servers idling on the global queue. The reason for this is responsiveness.

For example: If it takes 10 minutes for an instance to start up. And it’s being started because the web servers load is so high that you’re seeing unacceptably slow response times. Thats 10 extra minutes that you have to wait. But if you have a server Idling in “give me some orders” mode. The switch happens in just a minute.

So your orchestrator calls for a new web server. First it creates a new local queue for Web3. It then sends a request down the global queue pipe that a machine should assume a web server role and attach to the Web3 queue. This is the command to commission a new web server

Decommissioning the web server is basically the same, only in reverse. The orchestrator sends a message down the Web3 pipe asking for the the server to halt. The server responds with an OK once it’s finished its processing and is getting ready to actually shut down.

The rest of the magic happens inside the instance!

Since a server is kept at idle by the orchestrator, it’s basically sitting there monitoring the global queue for commission commands. Once per X number of minutes (or seconds) it checks the queue for such a command. And when it receives one is when the magic actually happens. In this case it’s a request for a web server on the Web3 queue. So the queue monitor runs a script designed to configure the machine to be a web server. The script grabs a copy of the proper document root and web server configuration and any necessary package from an external source (possibly an S3 storage container, possibly a subversion or cvs server, possibly just rsync’ing from a known source.) Once the local code and configuration has been updated, all the services required for running as a web server are started. Perhaps this is a) memcached, b) apache. Perhaps this is a) Tomcat, b) apache. Maybe its just starting apache. But thats not important. What’s important is that the server just changed itself, dynamically, into whatever was needed. It will then send a message up the Web3 queue that it is ready for action.

The orchestrator gets the message from Web3, perhaps registers it with a load balancer, or in DNS somewhere, and then goes about its business

On an order to decommission Web3 waits for connections to cease to apache (The orchestrator removed it from load balancing before the decommission request was sent). Turns off apache, turns off the supporting services. And then sends its log files out to some host somewhere (assuming you arent doing something more intelligent with standardized logging.) Web3 puts a message in the global queue that it’s halting. And it halts.

The orchestrator gets the message. Decommissions the Web3 Queue. And doesn’t think about it again until the next time that the web server load increases.

There is a *LOT* of detail that goes into setting a process like this up. And those details will change from server role to server role, and from organization to organization. But the general principle… Ought… To work 🙂

Ought? Well I haven’t tried it… … Yet

Traditional OPS VS Amazon AWS (part 2)

Our good friends at Amazon arrive on the scene. First they offered a stream of services which were interesting, but never really got your attention. But this last announcement. That one changes everything.

And it does change everything. Unfortunately people are thinking of the service in terms of their current ideas of scaling infrastructure. And I have a sneaking suspicion that it doesn’t work that way

The most obvious difference is bandwidth. You pay a definite premium on bandwidth when you use amazons web services, and EC2 is no exception. However one more sinister, sneaky, and diabolical quirk lies in the way. You see bandwidth is something that you may well be willing to pay a premium form. And its also mostly circumventable with judicious use of static content hosting, compression, caching, and the like. But permanence is something that most people take for granted. And permanence is exactly what EC2 lacks. This is the caveat which will start to change how people think about operational resources. And probably for the better.

Before I go into my current theories regarding the overall usage of Amazons web services to their fullest potential… Which will be part 3… Lets talk about how Amazon views all the things you’ve been taking for granted differently.

First let me state how the normal data-center works: It works via units. Most OPS departments consider the data-center operational unit to be one server. But most data-centers never reach the size and complexity of those attained by Amazon, Google, Etc. And they have to think outside several very limiting boxes. Consider the planet as an analogy. It wasn’t too long ago, in geological time frames, when the idea of running out of space (or any other resource) was seemingly absurd. However fast forward a thousand or two years to the present day, and we’re looking to tackle some very difficult problems head on: dealing with finite resources and a growing population. Just as its a certainty that, eventually, man will outgrow earth. So will any company outgrow their data-center in two or three distinct phases

Phase 1: We just need servers. At this point it’s basically a just get *something* game. You have *very* limited resources… and you get whatever you can. Right now is NOT the time to be worrying about how t spend money, but how to make it!

Phase 2: After several iterations of Phase 1… The diversity of our data-center is just unmanageable! At this point you realize that you need uniformity in the data-center to keep scaling… and hopefully you can afford it. Start phasing out the old patchwork network of randomly configured machines with shiny new identical machines. Enter a state of bliss for a good while

Phase 3: We’ve just realized that we cannot depend on unified hardware for that long. Our supply of this configuration of machine just ran out! The idea of throwing out 200 servers so that you can buy 500 more (for a gain of 300 servers) for the sake of this idea of uniformity makes you sick. And purchasing a thousand so that they can sit in a warehouse for use later is ridiculous as well… You’ll be running on 2 year old new servers in 2 years! Thats daft! There has to be a better way!

And there is. At this point the big boys stop thinking of a data-center in terms of machines. Break the machine into smaller units. Processor gigahertz, memory megabytes, processor gigabytes. Now lump all those together into a stack. And you have ?Ghz + ?Mb + ?Gb distributed throughout your data-center. But you still don’t have the complete solution. The last piece falls into place with virtualization. What this last piece of the puzzle allows you to do is to divvy out those individual resources as you see fit, and on the fly.

And thats what Amazon has done. They’ve pooled their processor power into one giant stack. They’ve pooled their memory into another. Their hard disk space into another. And they found that in this way they can hand out slices of those stacks on demand. And that they can also reclaim those slices on demand. So if you want to take advantage of the infrastructure that amazon is providing: stop thinking that a machine is a specific thing and only that thing

You’re in the world of on-demand. Lets start thinking in terms of what that means and how to use it to your advantage!

And I’ll begin that thought process in part 3

Traditional OPS VS Amazon AWS (part 1)

Lets take a look at your traditional web application setup (during development, or early stages with no funding). You might find something like this. We have a web, database, and specialty server. The specialty server is probably doing whatever makes the company unique. For simplicities sakes we’ll say its just crunching data from the database

The “Server Farm” “starting out”

+------------------------------
|-[ Web Server       ]
|-[ Database Server  ]
|-[ Specialty Server ]
+------------------------------

You bring in a little ad revenue, or a partner with a bit more money, and you expand… You double your capacity!

The “Server Farm” “expanded”

+------------------------------
|-[ Web Server       ]
|-[ Web Server       ]
|-[ Database Server  ]
|-[ Database Server  ]
|-[ Specialty Server ]
|-[ Specialty Server ]
+------------------------------

Eventually you end up with something like this:

2 load ballancers
30 specialty servers
16 web servers
12 database servers

The above is a classic solution to the “we need more power” problem. And it’s worked for a long time. However

It’s grossly inefficient as far as computational resource usage goes
Because you add on 20 new servers in *anticipation* of the load you will need in 4 months. So during month 1 you’re using 10% of those servers, month 2 30%, month 3 60%, and you find out at month 4 that you’re at about 120% utilization (hey having the problem of being successful isnt the worst thing hat could happen… but it sure is stressful!)
It’s grossly inneficient as far as personelle resource usage goes
Because your OPS team (or OPS Guy as the case may be) has twofold problems: First problem is that these machines are being purchased with months between them. And since there’s very little money in a startup they’re being purchased on sale. Which means you get whatever you can afford. Which ultimately means that after a year you have 9 different machine builds and with 7 different configurations, 2 processor types, 6 different network card types, many different RAM latencies, differences in CPU Mhz, disk speed, chasis, power consumtion, etc, etc, etc

Which means that "ballancing the resources properly" becomes a touch juggling act. That "diagnosing problems possibly due to hardware&quot becomes a long expensive drawn out process. Which means that installing OS’s on machines (for iether repairing dead drives, or repurposing old servers) means having to deal with an entire encyclopedia of quirks/exceptions/nuances/hinderances.

Trust me that it’s *very likely* that by the 4th batch of hardware bought you’ll start having to deal with the "oh… yea… thats from the second batch… you have to do [whatever] before it’ll work" By the 8th batch its a quagmire of re-learning all those old lessons.
It’s grossly inefficient in terms of resource utilization
If you only use 20% of your web servers abilities… except for 2 days a month when you’re at 80% pushing 90%… You’re loosing a HUGE amount of valuable CPU cycles, RAM, and hard disk space. And what you don’t want your OPS team to be doing is constantly repurposing machines. OPS people are human… eventually they’ll make a mistake. And you wont like it. Plus they have important time consuming jobs aside from having to spend 2 weeks every month re-imaging servers for a 2 day utilization spike, dont they?

Queue dramatic music

Enter the Amazon team

… to be continued

CodeWord: Apokalyptik

The random things that spew forth from my brain…

Amazon AWS