I’m always excited when we see something new for amazon web services

http://www.openfount.com/blog/s3infidisk-for-ec2

This certainly looks very interesting! I cant help but wonder if the memory caching in the neterprise version is enough to run small MySQL instances on? At the very least being able to MySQLdump regularly to a file directly on S3 would be useful as opposed to mysqldump to a file, split it into chunks, copy the chunks off to s3.

Perhaps I’ll contact them next week and see if they’ll let me take it for a test drive?!

HA EC2 Part #3: What Happens Once You’re Inside the Cloud

Onto what happens inside the cloud!

Since we’re looking to load balance what happens inside the cloud you might be tempted to ask why not use the same sort of method we used for load balancing (well at least fail-over) outside the cloud. And the answer is a resounding YOU CAN! But Rather like a cooking show where you _could_ use water to hydrate something but you could also “bring a little flavor to the party” by using chicken broth or wine, we can find ourselves with the option for a mighty fine set of bonus features — if we’re willing to look past the vanilla DNS round robin load balancing.

Ok, I suppose before I go on I should address you people still scratching your heads that I said you could use DNS for load balancing. I know you’re thinking that you’ll never achieve real balanced load with this method, and you’re right! Like I said more features await! And for everyone wondering why they should use a real load balancer if DNS is going to be “good enough” anyhow: remember that slight problem we mentioned about caching DNS servers? Well that problem would apply here as well, and since previously we couldn’t avoid it but now we can I see no reason not to. Plus it will be more of a headache here than there because the likelihood of a load balancer *LEFT ALONE* failing is a lot less likely than a web or database server which is constantly doing a great many things (and is subject to the whims of people — in the process of development.) Finally you get a single point at which you need to employ firewall configuration instead of X points where X is the number of back-end web servers.

Now as far as load balancers go there seem to be two prevalent kinds. First there is the TCP load balancer, and second there is the proxy.

A TCP load balancer functions on by IP address, Protocol, and Port. For example you may specify a group of servers as the back-end cluster for the IP address a.b.c.d port 80. And thats all you can do with that IP address and that port. It avoids a lot of complication by not caring, in the slightest, why or how you got there, or whether that particular set of web servers is really what you want. For this reason it’s not possible to specify fancy things like all requests on port 80 for example1.com go to cluster A, and example2.com to cluster B. To do that you need another IP address, or to access the cluster on a different port (say a.b.c.d port 81). For former is OK when you have multiple IP’s to work with and you can share IP’s between fail-over load balancers, but neither of those luxuries hold true in the EC2 environment. And the later is fine if you are planning to do this for your development team, but if you’re trying to drive normal web traffic to port 81 it might end up being a little less than convenient for your users.

Which is why I’m going to be focusing on the proxy.

Improperly configured the reverse proxy is a sure ticket to trouble on the internal LAN… Fortunately we aren’t dealing with an internal LAN, we’re dealing with servers which are publicly available anyway. What you can get from a reverse proxy, though, are extra features. You get load balancing, you get fail-over, you get url and host rule based redirection, and you get apache log file aggregation. You get it all… and, I might mention, at a very convenient low low price of FREE! WOO!

Um, er.. >clears throat< yea, never-mind that!

I would say that you have a couple of commonly known programs which can handle reverse proxying in Apache, and Squid. But… lets face it… both of those carry with them a more-so-than-necessarily complicated setup process and are like swatting at a fly with a cannonball. Sure they’ll work, but there re better alternatives for this. Two seemingly popular alternatives are pound and, the newcomer, perlbal. Both of these daemons offer better functionality that LVS (in the back-end server fail-over department) but which do we want to use?

The choice between the two is tough, and I can’t say that I’ve extensively used either, however pound does shine in three areas. First pound does have a notion of the idea of a SESSION and can even manage persistence/affinity (seemingly a LOT better than LVS manages it I might add (the persistence tables *ARE* wiped for a server that goes down!)) However if you have a web application which was developed with a “shared nothing” approach (which is a GOOD thing) this benefit does not really apply, so it’ll be up to the next to to knock your socks off. Second pound does SSL wrapping, taking that load off of your web servers (which is a good thing for responsiveness, isn’t it?). And finally pound offers a logging mode to emulate the apache combined log file format (both with and without virtualhosts.) Which puts pound in a class all by itself (and right up there with the hardware balancers I think). If none of those features matter to you or if (as is very possible) I’m wrong about perlbals feature-set. Then just pick one already (flip a coin, choose the one written in the language you prefer… or… hey… go read their docs and see which one you like better!)

The only drawback that comes to mind right now about using this approach is that you will be making use of text configuration files, so some parsing and rebuilding will end up being necessary for the registering and de-registering of web servers. I’ll add more in the comments if and when I think of more…

So there you have it.

Using a good DNS Service (with a low TTL, and decent API) mixed with a decent reverse proxy you have all the benefits of

  • a load balancer
  • load balancer fail-over
  • rules based request redirector
  • log consolidator
  • back-end server filover
  • and a single point for fire-walling

While this hasn’t exactly been a HOWTO, or a TOASTER, I hope that it’s definitely a pointer in the right direction for people who are looking to scale their applications which have been built on top of the Amazon EC2 (and SQS, and S3) services.

HA EC2 Part #2: Load Balancing the Load Balancer

Lets first address the problem of the dynamic IP address on the load balancer, because it doesn’t matter how good your EC2-side setup is if your clients can no longer reach your load balancer after a reboot. Also complicated because normally you want two load balancers to act as a fail-over pair in case one of them pops for some reason. Which means that we not only need to have the load balancers register with something somewhere we also need a method of de-registering them if, for some reason, they fail. And since downed machines usually don’t do a good job f anything useful we cannot count on them de-registering themselves unless we’re shutting them down manually. Which we don’t really plan on doing, now, do we?!

So here’s the long and short of the situation. Some piece of it, some starting point has to be outside the cloud. Now I know what you’re thinking: “but he just said we weren’t going to be talking about outside the cloud” but no, no, I did not say that; I said that we weren’t going to be talking about running a full proxy outside the cloud. I read that the EC2 team are working on a better solution for all of this, but for right now it’s in a roll your own state, so lets roll our own, shall we?

The basic building block of any web request is DNS. When you type in www.amazonaws.com your machine automagically checks with DNS servers somewhere, somehow, and eventually gets an IP address like this: 72.21.206.80. Now there can be multiple steps in this process, for example when we looked up www.amazonaws.com it *actually* points to rewrite.amazon.com, and finally rewrite.amazon.com points to 72.21.206.80. And this is a process we’re going to take advantage of. But first, some discussion on the possible ramifications of doing this:

DNS (discussed above) is a basic building block of how the internet works. And as such has had a dramatic amount of code written concerning it over the years. And the one type of code which may cause us grief at this stage is the caching proxy server. Now normally when you look up a name you’re asking your ISP’s DNS servers to look the name up for you, and since it doesn’t know it asks one of the primary name servers which server in the internet handles naming for that domain. once it finds that out it asks, a lot like this: “excuse me pdns1.ultradns.net, what is the address for rewrite.amazon.com?” to which your ISP gets a reply a lot like “The address for rewrite.amazon.com is 72.21.206.80 but thats only valid for 5 minutes.” So for 5 minutes the DNS server is supposed to be allowed to remember that information. So after 4 minutes when you ask again it doesn’t go to the source, it simply spouts off what it found out before. However after 5 minutes it’s supposed to check again… But some DNS servers ignore that amount of time (called a Time To Live (TTL)) and cache that reply for however long they feel like (hours, days, weeks?!) And when this happens a client might not get the right IP address if there has been a change and a naughty caching DNS server refuses to look it up for another week.

Alas, there is nothing we can do to fix that. I only mention it so that people don’t come knocking down my door yelling at me about a critical design flaw when it comes to edge cases. And to caution you: when your instance is a load balancer. It’s *ONLY* a load balancer. Don’t use it to run cron jobs, I don’t care if it’s got extra space and RAM, just leave it be. Because the fewer things happening with your load balancer the fewer chances of something going wrong, and the lower the chance of a new IP address, and the lower the chance of running into the above problem if the IP address doesn’t change, right? right!

So when you choose a DNS service you choose one which meets the following criteria:

  • API, you need scriptable access to your DNS service
  • Low (1-2 minutes) TTL
    (so that when something changes you only have 60 or 120 seconds to wait)

Ideally you will have two load balancer images. LB1 and LB2 (for the sake of me not having to type long names every time). You can do this dynamically (i.e. X number of load balancers off the same image), and if you’re a good enough scriptor to be able to do it, then HOW to do it should be fairly obvious.

When LB1 starts up it will automatically register itself at lb1.example.com via your DNS providers API. It will then check for the existence of lb.example.com, if thats not set then it will create it as pointing to itself. If lb.example.com was previously set it till preform a check (HTTP GET (or even a ping)) to make sure that LB2 (which is currently “active” at lb.example.com) is functional. If LB2 is not functional LB1 registers itself as lb.example.com. LB2 performs the same startup sequence, but with lb1 and lb2 switched where necessary.

Now, at regular intervals (lets say 60 seconds), LB1 checks the health of LB2 and vic a versa. If something happens to one of them the other will, if necessary, register itself at lb.example.com.

Well, I think that basically covers the portion of how this would work outside the EC2 cloud, next I’ll deal with what happens inside the EC2 cloud. (piece not written yet… so it’ll take a bit longer than the last two)

HA EC2 Part #1: Identifying the Challenges

I was recently asked to look into load balancing web servers on the Amazon Elastic Cloud Computing Service (EC2). And managing this presents some very interesting problems which need to be worked around. To look at the subject I’ll break it into 3 distinct pieces. #1: Identifying the Challenges (Which you’re currently reading), #2: Load Balancing the Load Balancer, and finally #3 What Happens Once You’re Inside the Cloud. No promises as to how quickly I get these out 🙂

First lets look at what this would normally entail:

You would have a data center, and a router which feeds into a DMZ. On the DMZ you would have a set of load balancers (either hardware or software.) A set so that if one failed the other would take over its job. These load balancers have static IP addresses on the DMZ as well as on the LAN. They also have a shared IP address which they are the balancers for. When one goes down the other takes over the IP address. In a hardware solution this might be accomplished in a fairly elegant and network invisible way. In a software solution this normally entails using IP aliases and forcibly updating the ARP cache on the router.

So the load balancers are the bridge between the DMZ and the LAN. On the LAN, with the load balancers, are a group of web servers. also with static IP addresses. There is a monitoring functionality on the load balancer which detects if a web server is no longer available. When that happens the load balancer updates an internal table and no longer sends requests to that particular web server. When the web server becomes available again the load balancer detects this, updates those internal tables, and begins sending requests to the server once more. All of that happens with varying levels of complexity.

For the scenario of the web servers reply there are multiple possible configurations. The web server may reply to the load balancer and the load balancer would then handle getting the proper response from your data-center to the client (a full reverse proxy). The web server might also reply directly to the client through a network route (in Linux Virtual Server (LVS) terms this is called “Direct Routing” (LVS-DR))

  [ WAN ]                                      -> [ Server ]
  [ ROUTER ]                                  |-> [ Server ]
  [ DMZ ] <-> [ Load Balancer ] <-> [ LAN ] <-+-> [ Server ]
                                              |-> [ Server ]
                                               -> [ Server ]

The first thing that jumps out at me is that there is one key assumption in the above setup possibilities, and that is that everything is able to obtain a static IP address. That is that every time a given machine goes down, it comes back up at the same IP address. This is not true of the EC2 service. Your EC2 instances are dynamically allocated new IP addresses (and host-names) each time they are started (and consequently restarted.) So…

  • No static IP for the load balancer
  • No static IP for the web servers

Which means that on top of the challenges of installing and configuring a normal software load balancing solution there are several fold more challenges to overcome to be “successful” in your endeavor.

  • You need to notify your clients if the load balancer address has changed
  • You need to notify your web servers if the load balancer address has changed
  • You need to notify your load balancer if the address of a web server has changed

Now you could, technically, circumvent the first o these challenges by housing the load balancer outside of the EC2 cloud, however this doesn’t make a whole lot of sense seeing as you would end up paying twice for all the bandwidth consumed (You would have to pay for the incoming request at the load balancer, then to make the same request to a web server, then the cost of the reply from the web server to the load balancer, and finally the cost of the reply from the load balancer to the client) so for the sake of this little mental pushup we’ll not even consider that a viable option, only worth mentioning (and we have, so now that thats over…)

SVN + RoR = Passive Version Controlled Goodness!

While working with both rails and subversion (which I like using to keep my projects under version control) I was irritated by having to go through and add or delete a bunch of files when using the code generation tools. Especially when first putting the project together, there always seemed to be 6 new files to manually add before every commit… So I wrote a script to handle the adding of new, and removing of missing files for a commit.

#!/bin/bash
IGNORE_DIRS="^\?[ ]{6}log$|^\?[ ]{6}logs$|^\?[ ]{6}tmp$"
IFS="
"
for i in $(svn st | grep -Ev $IGNORE_DIRS | grep -E "^\?")
  do
    i=$(echo $i | cut -d' ' -f7-1000)
    svn add $i
done
for i in $(svn st | grep -E "^!")
  do
    i=$(echo $i | cut -d' ' -f7-1000)
    svn del $i
done

Now I just ./rail_svn.sh and then svn ci and everything is always version controlled. Very nice. The only thing you have to watch out for is leaving files laying around (I’ve had a couple commits which, along with code, also involved removing a vim .swp file or two)

I would be willing to bet that this script would be a decent foundation for a passively version-controlled-directory system if anyone were to want to do something like that with svn (think mail stores, or home directories or anything in which files or directories are added or removed often). This is mainly needed because svn was designed to be an active version control system

QMAIL-TOASTER remote redilivery loop problem

I recently switched from my old gentoo server to a new FC5 server. I opted to go with a qmail-toaster setup because, while I’m perfectly capable of manually making my desired qmail+vpopmail setup, I just didn’t want to spend the personal time doing it. So I figured I would give the toaster project a try. And I have to say that I’m fairly impressed.

A lot of the core technological things that it did were done in basically the same way that I would have done them manually (which is bidirectionally gratifying for me) and there are some bells and whistles that are *nice* but I wouldn’t have bothered setting them up on my own (e.g. qmailmrtg graphical log analysis.)

I did (hopefully did and not still do) have one oddball problem with it. After switching over there were certain servers from which I would continuously get the same message over and over from. Everything in my logs showed a successful delivery, and its not as though the messages were stuck in my queue either, the remote servers would actually reconnect and deliver the message again.

Well for a while I had better things to do with my scant time than deal with this one inconvenient (but not critical) issue. Well today I finally cracked. Its probably because I’ve now gotten one particular message something on the order of 30 times now. Thinking about the problem, and examining my logs it seemed that the only time this happened was when a message was processed by simscan for viruses (clamd) and spam (spamd) at the SMTP transmission level. But that was not the complete story because other messages from other servers did not have this problem even though they went through simscan as well.

On a hunch I figured that the sending mail server was probably only designed to wait X number of seconds (or microseconds) after the finished transmission before expecting to get a status code back from my SMTP daemon. If it takes too long then the remote sending server might just assume the connection was lost and re-queue the message for redelivery. So I disabled spam and virus scanning in simscan

#echo ":clam=no,spam=no,spam_hits=12,attach=.mp3:.src:.bat:.pif" \
  > /var/qmail/control/simcontrol
# /var/qmail/bin/simscanmk
# /var/qmail/bin/simscanmk -g
# qmailctl restart

And the problem *seems* to have gone away. I’m not worried about viruses at this point because I’m running OSX as my desktop, and Thunderbird is usually pretty good about spam… so… no loss for me there.

I’m mainly writing this down here so that if someone were to have this problem, and floundering while searching for an answer, they might have a better chance of finding a helpful hint. Searching for things like redelivery and mail loops on google will yield nothing of any value at present.

Cheers
DK

Series: CRM on S3 & EC2, Part1

Danny de Wit wrote in with a request for collaboration on how to best use EC2 and S3 for his new Ruby On Rails CRM application. And I’m happy to oblige.

At this point I dont know much about what he’s doing, so I hope to start rough and open a dialogue with him and work through the excersice over a bit of time.

The story so far

We have a rails front end, a Dabatase backend, EC2, and S3

Well… that was a quick rundown…

Summary of what we will need to accomplish the task on S3 and EC2

First off we will need to be able to think outside the traditional boxes. But I think Danny is open to that. Second we will need to deal with the database itself. Third We have to deal with the issue of dynamic IP addresses. Fourth we have to deal with some interesting administrative glue (monitoring, alerting, responding) Fifth we have to deal with backups. And finally we have to deal with code distribution.

Now, Where do we start?

First we should start with the database. I wont lie to you, most of the challenge in regards to using these services will be centered around the database. We need to examine how it’s laid out, how its accessed, and what our expectations are when it comes to size. Specifically what we need to look for are two main things: A) bottlenecks, and B) data partitioning strategies.

Bottlenecks. We have to examine where we may or may not have trouble as far as data replication goes. Because if we are making hourly backups and we have to bring up another server at the half hour marker we’re going to have to have a strategy in place to bring the data store up to date. And the layout of the database can make this particularly prohibitive or it could make this very easy. And besides… having a bunch of servers doesnt help if they cant stay in sync.

Data partitioning. It’s easy to say “later on we’ll just distribute the data between multiple servers” but unless you’ve planned for a layout which supports this you might have a particularly difficult time doing so without makor reworking on your application. Also data partitioning can be your friend in the speed department as well. If you’re thoughtful about HOW you store your daya you can use the layout itself to your advantage. For example a good schema might actively promote load ballancing where a bad schema will cause excessive load on particular segments. A good schema will actually act as an implied index for your data, and a bad schema will require excessive sorting and indexing

So what now?

So, Danny, the ball is in your court. You have my e-mail address. You have my blog address. Lets get together and talk database before we move forward into the glue.

Random Musing: Bluring the Line Between Storage and Database?

As food for thought…

If you had a table `items`

  • itemId char(40),
  • itemName varchar(128),

Another table `tags`

  • tagId char(40),
  • tagName char(40),

And a third table `owners`

  • ownerId char(40),
  • ownerUsername char(40),
  • ownerPassword varchar(128),

It would theoretically be possible to have an S3 bucket ItemsToTags inside which you put empty objects named (ownerId)-(itemId)-(tagId). And a TagsToItems S3 bucket inside which you put empty objects named (ownerIf)-(tagId)-(itemId), it would then be possible to use the Listing Keys Hierarchically using Prefix and Delimiter method of accessing your S3 buckets to quickly determine what items belong to a tag for an owner, and what tags belong to an tag for an owner. You would be taking advantage of the fact that that There is no limit to the number of objects that one bucket can hold, and no impact on performance when using many buckets versus just a few buckets. You could reasonably store all of your objects in a single bucket, or organize them across several different buckets. (both the above links are to quotes taken directly from the S3 API docs provided by amazon themselves)

Using this method it would be possible, I think, to use the S3 datastore in a VERY cheap manner and avoid having to deal with the massive cost of maintaining these kinds of indexes in a RDBMS or on your own filesystems… Interesting. And since the data could be *anything* and you have, by default you have a many to many relationship here you could theoretically store *anything* and sort by tags…

Granted to find a tag related to multiple items you would have to make multiple requests, and weed out the diffs. but. if you’re only talking on the order of 2 or 3 tages per piece of data… it might just be feasible..

Now… Throw in an EC2 front end, and a SQS interface… interesting…

Makes me wonder what the cost and speed would be (if it would be an acceptable tradeoff for not having to maintain a massive database cluster)

Disclaimer: this is a random musing. I’m not advising that anybody actually do this…

Where should AmazonAWS go next?

We have SQS, we have S3, and we have EC2, so what next from the Amazon AWS team?.

There is really only one piece of the puzzle missing… And its a piece that has a lot of people griping. I have a strong hunch that Amazon is working on the problem, because I have a strong hunch that it is (or was) one of their major hurdles. And that problem is the database service.

How do you provide an easy to use interface to relational lookup-able storage? How do you make it universal? How do you make it secure? How do you make it FAST?

The first 3 questions are all answerable in roughly the same way: Make it a service, and let the service handle the interface, security, and universality. They’ve sucessfully applied the web-service to messaging, storage, and cpu power, theres no reason that this wouldnt be the final piece to the jigsaw puzzle. The last question carries with it the greatest problem, though. Allowing people to store data and run queries without the innevitable tanking of the server process would be a challange, to say the least (artificial intelligence is no match for human stupidity, after all).

But thats besides the point. If you break down into two components: anchors and tags — that is something is data or something is data about the data. provide a schema that works without collision problems, and – more importantly works both ways (finding tags related to an anchor, AND finding anchors relating to a tag) you cover probably 90% of peoples needs in one fell swoop.

I’ve been thinking a lot about how to do this, lately, as I’ve been drowning in a sea of data myself which is easy to manage in one direction but difficult in the other while keeping the size of the whole thing down.

Not only would that provide Amazon with the ability to have its finger in basically every new technological cookie jar BUT would provide huge massive gigantic enormous amounts of datas on what people really think about things. It would be an exceptional win for amazon, I think, and could indeed be leveraged to a huge advantage in the marketplace market. Because, as netflix has shown us recently, reliably finding things which relate to other things is *big* business.

Is compute as a service for me?

Note to Nick: I havent forgotten your request and I’ll have something on that soon, but when I started in I found that I had something else to say about compute-on-demand (or compute-as-a-service – terms which i use somewhat interchangably) So here it is. For all those people just jumping into a project or considering restructuring a project around these new trends I hope this helps. I would definately not consider this (or anything else I write) a GUIDE per se, but food for thought.

We live in an interesting world, now, because every new tech project has to ask itself at least one very introspective question: “is computing as a service the right thing for us?” And MAN is that ever a loaded question. At first blush the answer seems like a no brainer: “of course its for us! we don’t want to pay for what we don’t use!” Which is, at the basest level, true. But the devil is always in the details…

So which pair of glasses do you have to approach this problem with? What are the consequences of choosing wrong? How do we do it? Slow down. First you need to put some thought into these two questions: “what do we do?” and “how do we do it?” Because that is the foundation of which road holds the path to success and which to failure.

Are you a media sharing service which houses a billion images and gets thousands more every second? Are you a news aggregator which houses millions of feeds hundreds of millions of posts? Are you a stock tracking company which copes with continuous feeds of data for portions of the day? are you a sports reporting company who has five to twenty posts per day but hundreds of thousands of reads? Are you a modest blogger? Do you just like to tinker with projects?

As you can see all of those are very complex environments with unique needs, stresses, and usage spreads. And writing a blog entry which addresses whether each possible type of business should or shouldn’t use on demand computing would be impractical, not to mention impossible. But for the web industry there are a couple basic types of environments: “Sparse Write, Dense Read”, “Dense Write, Sparse Read”, with subtypes of “Data Dense” and “Data Sparse”

Environment: Sparse Write, Dense Read

For a lot of web applications you’re really not dealing with a lot of data. If you’re running a content management system or you’re a directory you have a finite amount of data which, in comparison with the number of times it’s read, is written to fairly infrequently. (In this case Infrequently written means that a databases query cache is a useful optimization for you.) It’s also very likely that you will be able to take a snapshot of your data in this type of environment in a fairly convenient manner. Compute as a service is probably right up your alley, and here’s why.

You are likely to have very normalized times during which your reads (or your writes) spike, meaning that you can actively plan for, setup, and use on demand resources to their fullest potential. Remember that an on demand resource is not an instant problem solver. In the case of something like Amazon EC2 it can take 5, 10, or 15 minutes for the server you’ve requested to even become active. After the server is up there has to be some process which gets all of the relevant data on it up to date. What this means is that you might be looking at 1/2 an hour before your 5 extra servers are ready to handle the 7:00 am to 9:00am traffic spike that everyone getting to the office in the morning generates. With your service, thats fine though. Just plan to turn the extra power on an hour early and turn it off half an hour after you expect the spike to be over. Wash rinse repeat.

Environment: Dense Write, Sparse Read

See this is the more complicated of the two environments. Everyone and their mother knows how to build a database driven application which gets a few writes and a lot of reads because thats what your common RDBMS are built for. Think of it as being idiot proofed out of the box 🙂 But when you have a backwards (as in usage, not as in technology) environment all of a sudden you have a lot of “conventional wisdom” which isn’t so wise anymore (what do you mean a faster write server than read servers causes replication problems?) (what do you mean my uber-normalization is the problem?).

It’s in this type of environment when we really have to look at the subsets of data, because the proof really lies in the pudding — so to speak.

Sub Environment: Data Sparse

You work with a relatively small window of data in realtime. You may or not get a request for all of the data you’re keeping continuously up to date, but you have to keep it that way or its your butt on the line, right? Well you’re probably in luck. I think it’s fairly likely that your data size is a relatively small one, for example you’re keeping a window with a period of 24 hours of data updated. Likely there is a *LOT* of history kept but thats kept elsewhere. Once you’re done with the data you shove it right out the backend into another process and it gets handled there (that backend is likely a sparse write sparse read environment which is extremely data dense — not for on demand computing (well maybe, but thats another blog post)).

For this environment compute as a service is probably going to be a godsend… if you can overcome one small, teentsy weentsy, ever so small yet still important detail: the development team. Now not all companies are going to have difficult development teams, but some do, and you simply cannot build an environment ripe for compute as a service without their cooperation, so be prepared whatever the case! You will likely be able to leverage hotcopy, or an LVM style live-action backup for insta-backups to your long term storage solution (or on-demand setup pool). You will likely be able to leverage the extra compute capacity for your peak load times. And everything will likely turn out OK. So long as you can get some of the crucial application details hammered out.

Sub Environment: Data Dense

I pity you. Compute as a service is probably not what you need. Cases may vary and, again, the devil is in the details. But you have a huge challenge ahead of you: Building an environment where a server can be programatically brought online and then caught up to date with the current compute pool in a time frame which makes even doing it a winning situation. This is something I’m going to put a lot of thought into… note to self… But unless you have some bright ideas here (and if you do, please send them my way) you have basically one chance: data partitioning. Get yourself a VERY good DBA, and really REALLY plan out your data. If you put enough thought into it in the beginning you have a chance to keep the individual pieces of data down to a small enough (and distributed enough) level which just might lend itself to compute as a service in a very LARGE way (but we’re really talking about going WAY beyond the 10 or 20 allowed Amazon EC2 server instances here)

Uh, Ok, Enough about these different environments… what do I need to do to USE on demand computing?

Well thats a difficult question to answer in a generally useful way. so without getting too specific:

You, first and foremost, need to have compute as a service thought about in every bit of your planning and executing stages. At every point in the set of long chains which make up your application you have to ask yourself “what happens if this goes away?” and plan for it.

A very close second is think pipes and hoses rather than chisel and stone. Each part of your environment should be as self contained as possible. When one hose springs a leak the solution is simple, replace the hose (and just bypass it for the mean time,) but when you loose a section of your monolithic structure things are a bit more complicated than that.

Finally you need to understand that you will have to work at taking full advantage of compute as a service. Remember that you are going to have to put TIME and ENERGY into using this kind of a service. Nothing comes free, and even in operations everything has an equal and opposite reaction. If you want to avoid spending the time and energy and money maintaining a hardware infrastructure you will have to put the same into avoiding one. But the benefits of doing so are real and tangible. Because when you’ve spent all of your time building an application which is fault tolerant rather than building an infrastructure which will fail you invariably provide to your user base a more robust and reliable service.