SVN + RoR = Passive Version Controlled Goodness!

While working with both rails and subversion (which I like using to keep my projects under version control) I was irritated by having to go through and add or delete a bunch of files when using the code generation tools. Especially when first putting the project together, there always seemed to be 6 new files to manually add before every commit… So I wrote a script to handle the adding of new, and removing of missing files for a commit.

#!/bin/bash
IGNORE_DIRS="^\?[ ]{6}log$|^\?[ ]{6}logs$|^\?[ ]{6}tmp$"
IFS="
"
for i in $(svn st | grep -Ev $IGNORE_DIRS | grep -E "^\?")
  do
    i=$(echo $i | cut -d' ' -f7-1000)
    svn add $i
done
for i in $(svn st | grep -E "^!")
  do
    i=$(echo $i | cut -d' ' -f7-1000)
    svn del $i
done

Now I just ./rail_svn.sh and then svn ci and everything is always version controlled. Very nice. The only thing you have to watch out for is leaving files laying around (I’ve had a couple commits which, along with code, also involved removing a vim .swp file or two)

I would be willing to bet that this script would be a decent foundation for a passively version-controlled-directory system if anyone were to want to do something like that with svn (think mail stores, or home directories or anything in which files or directories are added or removed often). This is mainly needed because svn was designed to be an active version control system

Googles new zen-like philosophy: Be the Internet

Picture the internet

Picture yourself as the internet

Now be the internet

At least thats what I figure google is doing. From dark fiber, to databases. From word processing to spreadsheets. From blogs to email. From widgets to wikis. From web ads to web stats. From videos to version control. Google is looking to BE the internet. It makes sense too. If you have to choose between spending the time to actively acquire new information in a space growing with the ferver of the internet versus passively already-having the information, which would you choose? I’d choose passive ownership!

What better way to be able to search the worlds information than to have that information already passing through the very arteries of your network of its own volition. Google turns into the worlds most advanced stateful packet sniffer. Poof. Google is now not more than one relationship awway from almost every single information resource on the planet.

What would google have to do to complete the coup? First “finish” and “polish” its products to get a large enough userbase on all of them for the viral-style-marketing-by-necessity “advertising” (but thats not the right word) to work. Next buy amazon, and e-bay (?) outright. Third. Work on not getting “acquired” by various governments itself (otherwise we will have something like a very dangerous big-brother on our hands).

Is this a far fetched conspiracy theory? Probably. But thats what you get with a mind that doesn’t have an off button and “downtime” in the shower. Heh. Is it likely? Maybe not, maybe. But is it an intriguing idea? Definitely!

Ruby, Rails, and brute force

I’ve started teaching myself to work with Ruby on Rails. And since I’m a hands on guy I tend to get myself into odd situations and have to brute force myself out. For example last night I was dealing with trying to display a number of records from a database (there were only 3 records) and all I could get it to display was

###

Well it turns out that when using embedded ruby (eRuby) these two blocks of code are VERY different.

## This works
<% for i in Network.find(:all) %>
  <%= h i.name %>
<% end %>

and

## This doesnt
<%=
  for i in Network.find(:all)
    h i.name
  end
%>

Ahh, it’s always the little things in life which give you the greatest joy and the greatest frustration!

Series: CRM on S3 & EC2, Part2

So we’ve touched a bit on what to look for in your database. The comments made were by no means specific, and the requirements will vary from place to place. But the underlying principals are what are really important there. Now lets move on to something a bit more specific. Backup.

There is an important caveat to this information: Nobody has done this enough to really have a set of scalable one-size-fits-all tools (or a tool chain) fit for the job… You’ll have to be OK with doing some in-house experimentation. And be OK with the idea of maybe making a couple of miss-steps along the way. As is the case with any new (OK new to YOU) technologies there are some things you just have to learn as you go.

To setup a system that is fault tolerant, and to develop a system in which you manage your risks requires a balance of acceptable versus unacceptable trade off situations. Your main types of backups are as follows:

A) Simple local backup. your old stand-by tar and his friends bzip2, gzip, and even compress. They’ve been doing backups since backups were backups (or almost anyhow) and they are good friends. In this kind of a situation they aren’t the whole solution but you can bet your butt that they’re a part of it.

B) Hard-Copy backup. This isn’t what you want, but worth mentioning. This kind of backups consists of hard disks, tapes, CDs DVDs, etc, which are copied to and then physically removed from the machine. The advantage to this type of backup is that you can take them offsite incase of a local disaster, but in an EC2+S3 business there is no such thing as a local disaster. So if you, once per week/month/whatever, just copy down your latest backups from S3 that should suffice.

C) Copy elsewhere backup. This is going to be bread and butter for the bulk of the solution. It’s not the entire solution. But it’s a fairly big piece. In this case S3 is your “elsewhere”

D) Streaming backups. Examples of streaming backups are MySQL’s replication, or pushing data into an Amazon SQS pipe for playback at a later point. Also a key player in what will surely be your ending strategy.

Well that was fun. Learning just enough to be dangerous but not enough to actually do anything… And certainly not enough to answer the question. So lets get to it.

You will have two distinct areas of backup which will be important to you. You have the UI end, and the DB end. Both these sections should be approached with different goals in mind, because the usage pattern on them ends up being different.

The Front End

You’ve no doubt got a development environment setup somewhere, and as you make bug fixes to this environment, or add features, or change layouts to announce your IPO, or whatever you need to push a snapshot to your servers *AND* any new servers you bring up need to have the new UI code and not the old UI code.

For the sake of argument, here, I’ll assume that you have a SVN or CVS server which holds your version-controlled code (you *ARE* using version control right?) So your build process should, ideally, update the stable branch on your Revision Control Server, and send out a message to your UI servers that an update is available. They should then download the new code from RCS to a temporary directory, and once there you pull the fast-move trick:

$ mv public_html public_html.$(date +%s) && mv public_html.new public_html

At this point all of your UI servers received the message at the same time, and update at the same time. Any new server should have, in its startup scripts sometime after the network is brought up, a process which performs the above update before even bringing up your HTTP service.

And that was the easy part… Now for MySQL

As for MySQL, I’ve outlined my thoughts on that here already in my article: MySQL on Amazon EC2 (my thoughts) Which options you choose here depend on a couple of things: First the skill level of the people who will be implementing the entire procedure *AND* the skill level of the people who will be maintaining it (if those people aren’t the same people). But one very serious word of caution: Whatever you do stop thinking of an EC2 instance as 160GB of space for MySQL and start thinking of it as 60Gb (70GB MAX) because backing up something that you do not have the space to copy is a difficult task which normally required bringing things offline — trust me on this.

My gut feeling for you is that if you owned/rented one physical server to be your write server for your database setup. something roughly equal to the specs of the EC2 virtual machine, except with 320Gb of disk space. That would be your best bet for now. You could keep your replication logs around for the entire history of your database… for a while

You also should keep one extra MySQL instance (on EC2 if you like) up and running for the sole purpose of being up to date. You would then periodically turn it off and copy the entire thing up to S3. So that when you had to restore a new instance you would simply copy those files down, assign the server-id, and let it suck everything new down via replication.

The gotcha here is that this wont last forever… at least not on one database. There will come a time, if you get a “lot” of usage, when the process of downing a server copying it, copying it, bringing it up and waiting for replication will become infeasible. It will eventually just stop adding up. It’s at that point you will have to make a couple of careful choices. If you have properly laid out your schema you can pull your single monolithic database apart, distribute it amongst several database clusters, and carry on as you have been. If you have properly laid out your schema in a different way you will be able to assign certain users to certain clusters and simply write a migration tool for moving users and their data around between database clusters. If you have not properly laid out your data you can choose whether to spend time and money re-working your application to make it right. Or you can spend time and money on buying big “enterprise class hardware” and give yourself time to make things right.

Unless you can truly count on being able to bleed money later on. You’ll VERY CAREFULLY consider your schema now. It will make all the difference. And if you end up with 2+TB of data which is completely unmanageable… well don’t say I didn’t warn you… Those kinds of optimizations may seem silly now when you’re only expecting 5-25GB of data but they wont be silly in 2-4 years.

Series: CRM on S3 & EC2, Part1

Danny de Wit wrote in with a request for collaboration on how to best use EC2 and S3 for his new Ruby On Rails CRM application. And I’m happy to oblige.

At this point I dont know much about what he’s doing, so I hope to start rough and open a dialogue with him and work through the excersice over a bit of time.

The story so far

We have a rails front end, a Dabatase backend, EC2, and S3

Well… that was a quick rundown…

Summary of what we will need to accomplish the task on S3 and EC2

First off we will need to be able to think outside the traditional boxes. But I think Danny is open to that. Second we will need to deal with the database itself. Third We have to deal with the issue of dynamic IP addresses. Fourth we have to deal with some interesting administrative glue (monitoring, alerting, responding) Fifth we have to deal with backups. And finally we have to deal with code distribution.

Now, Where do we start?

First we should start with the database. I wont lie to you, most of the challenge in regards to using these services will be centered around the database. We need to examine how it’s laid out, how its accessed, and what our expectations are when it comes to size. Specifically what we need to look for are two main things: A) bottlenecks, and B) data partitioning strategies.

Bottlenecks. We have to examine where we may or may not have trouble as far as data replication goes. Because if we are making hourly backups and we have to bring up another server at the half hour marker we’re going to have to have a strategy in place to bring the data store up to date. And the layout of the database can make this particularly prohibitive or it could make this very easy. And besides… having a bunch of servers doesnt help if they cant stay in sync.

Data partitioning. It’s easy to say “later on we’ll just distribute the data between multiple servers” but unless you’ve planned for a layout which supports this you might have a particularly difficult time doing so without makor reworking on your application. Also data partitioning can be your friend in the speed department as well. If you’re thoughtful about HOW you store your daya you can use the layout itself to your advantage. For example a good schema might actively promote load ballancing where a bad schema will cause excessive load on particular segments. A good schema will actually act as an implied index for your data, and a bad schema will require excessive sorting and indexing

So what now?

So, Danny, the ball is in your court. You have my e-mail address. You have my blog address. Lets get together and talk database before we move forward into the glue.

Random Musing: Bluring the Line Between Storage and Database?

As food for thought…

If you had a table `items`

  • itemId char(40),
  • itemName varchar(128),

Another table `tags`

  • tagId char(40),
  • tagName char(40),

And a third table `owners`

  • ownerId char(40),
  • ownerUsername char(40),
  • ownerPassword varchar(128),

It would theoretically be possible to have an S3 bucket ItemsToTags inside which you put empty objects named (ownerId)-(itemId)-(tagId). And a TagsToItems S3 bucket inside which you put empty objects named (ownerIf)-(tagId)-(itemId), it would then be possible to use the Listing Keys Hierarchically using Prefix and Delimiter method of accessing your S3 buckets to quickly determine what items belong to a tag for an owner, and what tags belong to an tag for an owner. You would be taking advantage of the fact that that There is no limit to the number of objects that one bucket can hold, and no impact on performance when using many buckets versus just a few buckets. You could reasonably store all of your objects in a single bucket, or organize them across several different buckets. (both the above links are to quotes taken directly from the S3 API docs provided by amazon themselves)

Using this method it would be possible, I think, to use the S3 datastore in a VERY cheap manner and avoid having to deal with the massive cost of maintaining these kinds of indexes in a RDBMS or on your own filesystems… Interesting. And since the data could be *anything* and you have, by default you have a many to many relationship here you could theoretically store *anything* and sort by tags…

Granted to find a tag related to multiple items you would have to make multiple requests, and weed out the diffs. but. if you’re only talking on the order of 2 or 3 tages per piece of data… it might just be feasible..

Now… Throw in an EC2 front end, and a SQS interface… interesting…

Makes me wonder what the cost and speed would be (if it would be an acceptable tradeoff for not having to maintain a massive database cluster)

Disclaimer: this is a random musing. I’m not advising that anybody actually do this…

How S3 Fits in Comparison to Other Storage Solutions

So, recently Nick G. Asked “Since you’ve worked with S3 a good bit, I’d like to get your take on using a service like S3 compared to using a local instance (or cluster) of MogileFS?”

I’d like to interject here and mention that in this case “quite a bit” means I’ve used it in one application for data backup, at an early stage during which there was no good example (much less released interface) for using S3 with PHP code. So I wrote and distributed my own. I’m sure that it’s fallen into disuse and more active projects are likely to be favored. So thats my “lots of experience”. Always take what I have to say with the appropriate amount of salt

Any my answer would be that each type of storage solution listed has both strengths and weaknesses, and determining which set best compliments your application needs will tell you where you should invest. I would also throw another option into the pot: the SAN. While a SAN might not be in the range of your average garage tinkerer it *is* in the range of medium or large startups with proper funding. I do however believe the question was geared more towards a slant on an “each of these versus S3” analysis, so thats how I’ll approach the question.

But first… Let me get this out of the way. S3 looses, by default, if you absolutely cannot live without block device access. Do not pass go, do not collect $200. It’s a weakness and you’ll have to be willing to accept it for what it is.

S3 Vs. SAN (Storage Area Network)

Your most tried and true contender in the mass storage market is probably the SAN. Those refrigerator sized boxes which sit in warehouse sized data-centers thirstily consuming vast amounts of electricity, and pushing the bits through slender delicate orange fiber-optic cables. Your basic sales pitch surrounding any SAN these days are comprised of the same points in varying degree:

Expandability: No modern SAN would be complete without the promise of expandable storage. On a quick and dirty level a SAN has a bunch of disks which it slices and dices into pieces and then glues those pieces together into a “logical unit”. So many many hard drives become just one hard drive. However keep in mind that you have to use a filesystem which supports dynamic expansion, and you almost always have to dismount the volume to accomplish it to boot.

Backup: At a small cost of “twice what you were planning to pay plus $30,000 for the software” you ought to be able to, with any modern SAN, preform realtime on-the-fly backups. I would throw in negative commentary here, but I think the sales pitch bares its negative connotations in a fairly self evident manner.

Clustering: You can have multiple machines accessing the same hard drive! Which is great as long as you can setup and use a clustering filesystem. What they fail to tell you is that using a *normal* non-cluster aware FS will get you nothing but massive data corruption. So unless you plan on using some cookie cutter type system for accessing the storage, and are planning on spending big bucks on having it built for you… the clustering is going to be less than useful. Also you cannot run multiple MySQL database instances on the same part of a shared disk like that, so get that idea out of your head too (disclaimer: I know not if allowing MySQL access to the raw partition fares any better in this case, but I somehow doubt it).

High availability/integrity: So long as you buy a bunch of extra hard drives for the machine you can expect to handle failures of individual disks gracefully. That is if the term gracefully includes running at 25% slower for a couple of hours while bits get shifted around… and then again when the broken drive is replaced… But, no, you wont loose your data

Speed: Yea… SAN’s are fricken fast… no doubt… SAN’s usually function on a dedicated fiber-optic network (afore mentioned delicate orange cables) so a) they don’t saturate your IP network, and B) aren’t limited to its speed

So how does S3 stack up against the SAN? Well, lets see… Expandability: S3 has a SAN beat hands down with not only implied expandability but also with implied constriction, S3 you pay for what you use.

Backup: Amazon guarantees data retention, no need to pay extra. Clustering: again, covered, providing that you have built your application to play nice in some way there is no problem here.

High Availability and Integrity: Here there is more of a tradeoff since a SAN is a guaranteed write and then immediately be available, and S3 is a write once, eventually stored. One of the hurdles with S3 is that it may take a while (an unknown period of time) for a file stored in S3 to become available globally, making it less than ideal to, say, host html generated by your CMS — thats not to say that its impossible, but there may be an indeterminate period when you have a page linked to and only half your viewers can access it (you would think you could get around this by storing the data first an then the index last, but there is no guarantee that the order in which items are sent is the order in which they will become available.)

And finally Speed: Here the SAN wins out — you pay for bandwidth to connect to amazon’s S3 service, and you cant, and wouldn’t want, to pay the bills for a sustained multi-gigabit per second connection to S3 (ouch)

Therefor: If you can handle A) a small time-to-availability, B) non-block-access, and C) a speed limited by the public internet connection. Then S3 is probably a better choice. But for the total package… if you have the resources… the SAN is irreplaceable.

S3 Vs. NAS (Network Attached Storage)

The NAS is like the SAN’s little brother. They tend to function much the same as a SAN, but are usually A) put on the IP network which can cause saturation and limits speed, B) are usually not as robust in the H.A. and Data integrity department, C) have a lower cap on their ultimate expandability, and D) cost a whole hell of a lot less than a SAN.

So the NAS has carved out a well deserved niche in small business and some home offices because it provides a bunch of local storage at a much more reasonable price. We, therefor, cannot evaluate its pros and cons on the same points as we did the SAN. NAS are often used to store large files locally in a shared manner. Many clients mount the shared volume and are able to work collaboratively on the files stored there. And for this reason S3 is not even thinking about encroaching on the NAS space. First off a home DSL working on a 100MB CAD file is not feasible in the same way that it is on a NAS. It would be an awful thing to wait for 100MB to save at 12Kb/sec – Period. Also the idea of using a multi-user accounting software to have two accountants in the records at the same time is basically impossible…

If you’re thinking about the NAS in a data-center type environment, I’m going to consider it lumped in with either the homegrown cluster solution (small NAS) or the SAN (large NAS)

So if you need a NAS… stick with a NAS. HOWEVER consider S3 as a very convenient, effective, and affordable alternative to something like tape based backup solutions for this data.

S3 Vs. HomeGrown Cluster Storage

The home grown clustering solution is an interesting one to tackle. NFS servers, or distributed filesystems (with or without local caching), or samba servers, or netware servers, all with-or-without some sort of redundancy built in, and all with varying levels of support attached. And thats your biggest challenge in this space: finding support.

You will have to build your application to take into account the eccentricities of the particular storage medium (varying level of POSIX support, for example) but knowing what those quirks *are* will save you time frustration and gobs of money later on. Because if you’re using some random duct-taped-solution thats been all Mac’d out it will probably do the trick — but what happens if the guy who designed it (and thus knows how all the pieces fit together) leaves the company or gets hit by a bus? well… you’re probably out of luck. But with S3 you have a very large pool of people all rallying around one solution with one (ok or two) access methods and it simply is what it is.

There are really no surprises with S3 which is the first reason that it beats out the custom tricked out storage solution. The second reason is that there is no assembly required — except maybe figuring out which access library to use. No assembly means no administration. No administration means better code. Better code means getting sold like a hot video sharing company. Well… One can dream

S3 Vs. Local Storage

Aside from the obvious block access, and up-to-scsi speeds that local storage provides it looses to S3 in almost every way.

It’s not expandable very far. It’s not very fail-safe. It’s not distributed. It requires some form of backup. It requires, power, cooling, room, and physical administration. My advice: if you can skip the hard drive you SHOULD.

S3 Vs. MogileFs

MogileFS is an interesting comer in this particular exercise-of-thought. It’s a kind of hybrid between the grow-your-own cluster and the local storage bit. It offers an intriguing combination of pro’s vs con’s, and is probably the most apples-to-apples comparison that can be made with S3 at all. Which makes me wish I’d had more of a chance to use it.

But the basic premise is that you have a distributed system, which is easily expandable, and handles data redundancy. My understanding is that you classify certain portions of the storage with a certain number of redundant copies required to be considered safe, and the data is stored on that many different nodes. When you make a request for the file you are returned a link in such a way as is meant to distribute the read load among the various servers housing the data. You also have a built in fail-safe for a node going down and shouldn’t be handed a link to a file on a downed node.

So what does all that mean? Well if you went about trying to build yourself a non-authenticated in-house version of Amazon’s S3 service you would probably end up with something that is remarkably similar to MogileFS. I wouldn’t even be surprised to find out that S3 is modeled after Mogile. What’s more Mogile has a proven track record when it comes to serving files for web based applications.

So how do they actually compare? I would say that for a company deciding whether or not to use Mogile Versus S3 it comes down to a couple of key factors. A) source and destination of traffic, B) type of files being distributed, and C) up front investment.

As far as your traffic. If you’re planning on using Mogile primarily internally and data will rarely leave the LAN then you will not be paying for bandwidth costs associated with S3. That makes for a pretty simple solution. If you are distributing the files to a global audience, however, you might find that using S3 to pay for bandwidth costs along with handling local availability, delivery speed, and high availability is a win. However I’d be fairly inclined to guarantee (as I’ve covered before) that the raw bandwidth purchased from your ISP is a lot cheaper than from Amazon AWS, so long as you already have all the necessary equipment in place for redundancy, delivery, etc, Mogiles advantages brings it within striking distance of S3.

If you are distributing primarily small files (images, etc) then mogile is not going to present to you any challenges. If, however, you are serving 100MB video files or 650MB CD images Mogile might actually work against you. When I tried to use Mog for this kind of an application there was a limit on the size of an individual file that it was willing to transfer between hosts. In this respect Mog broke its own replication. DISCLAIMER: I only spent a week or so total with Mog (broken up into hour here and hour there sessions) this might have been a) known, or b) easily worked around, but my quick googling at the time yielded little help. The idea of having to split large files was a deal breaker at the time and other things were pressing for my attention.

And the real thing that Mog does require which S3 does not is a hardware and manpower investment. Since you’re going to have to work your application in a similar manner to house data in either S3 or in MogileFS, S3 wins out on sheer ease of setup… All you have to do is signup for an AWS account, pop in a credit card number, and you’re on your way. That same hour. You also don’t run out of space with S3 like you can with Mog, granted Mog can be easily expanded — but you have to put more hardware into it. S3 is simply already as large as you need it to be.

Summary

In the end what these choices always come down to is some combination of the classic triangle: Time Vs. Money Vs. Manpower. And what storage is right for you depends on how much of each you are willing to commit. Something always has to give. The main advantage of S3 is tat you’re borrowing on the fact that Amazon has already committed a lot of time and hardware resources which you can leverage if the shoe fits.

More than likely what you’ll find is that the “fit” os something like S3 will be a seasonal thing. When you start out developing your application and you don’t have resources to throw at it using S3 for your storage will make a lot of sense because you can avoid the whole issue of capacity planning and purchasing hardware with storage in mind. Then you will probably move into a quasi-funded mode where it is starting to, or outright gets, too expensive to use S3 versus hiring a an admin and throwing a couple of servers in a data-center. And then you might just come back full circle to a point when you’re drowning in physical administration, and spending a little extra for ease of use and peace of mind comes back into style.

So which is right for you? Probably all of the above, just at different times, for different uses, and for different reasons. The key to your success will likely lie in your ability to plan for when and where each type of storage is right. And to already have a path in mind for when it’s time to migrate.

Where should AmazonAWS go next?

We have SQS, we have S3, and we have EC2, so what next from the Amazon AWS team?.

There is really only one piece of the puzzle missing… And its a piece that has a lot of people griping. I have a strong hunch that Amazon is working on the problem, because I have a strong hunch that it is (or was) one of their major hurdles. And that problem is the database service.

How do you provide an easy to use interface to relational lookup-able storage? How do you make it universal? How do you make it secure? How do you make it FAST?

The first 3 questions are all answerable in roughly the same way: Make it a service, and let the service handle the interface, security, and universality. They’ve sucessfully applied the web-service to messaging, storage, and cpu power, theres no reason that this wouldnt be the final piece to the jigsaw puzzle. The last question carries with it the greatest problem, though. Allowing people to store data and run queries without the innevitable tanking of the server process would be a challange, to say the least (artificial intelligence is no match for human stupidity, after all).

But thats besides the point. If you break down into two components: anchors and tags — that is something is data or something is data about the data. provide a schema that works without collision problems, and – more importantly works both ways (finding tags related to an anchor, AND finding anchors relating to a tag) you cover probably 90% of peoples needs in one fell swoop.

I’ve been thinking a lot about how to do this, lately, as I’ve been drowning in a sea of data myself which is easy to manage in one direction but difficult in the other while keeping the size of the whole thing down.

Not only would that provide Amazon with the ability to have its finger in basically every new technological cookie jar BUT would provide huge massive gigantic enormous amounts of datas on what people really think about things. It would be an exceptional win for amazon, I think, and could indeed be leveraged to a huge advantage in the marketplace market. Because, as netflix has shown us recently, reliably finding things which relate to other things is *big* business.

Is compute as a service for me?

Note to Nick: I havent forgotten your request and I’ll have something on that soon, but when I started in I found that I had something else to say about compute-on-demand (or compute-as-a-service – terms which i use somewhat interchangably) So here it is. For all those people just jumping into a project or considering restructuring a project around these new trends I hope this helps. I would definately not consider this (or anything else I write) a GUIDE per se, but food for thought.

We live in an interesting world, now, because every new tech project has to ask itself at least one very introspective question: “is computing as a service the right thing for us?” And MAN is that ever a loaded question. At first blush the answer seems like a no brainer: “of course its for us! we don’t want to pay for what we don’t use!” Which is, at the basest level, true. But the devil is always in the details…

So which pair of glasses do you have to approach this problem with? What are the consequences of choosing wrong? How do we do it? Slow down. First you need to put some thought into these two questions: “what do we do?” and “how do we do it?” Because that is the foundation of which road holds the path to success and which to failure.

Are you a media sharing service which houses a billion images and gets thousands more every second? Are you a news aggregator which houses millions of feeds hundreds of millions of posts? Are you a stock tracking company which copes with continuous feeds of data for portions of the day? are you a sports reporting company who has five to twenty posts per day but hundreds of thousands of reads? Are you a modest blogger? Do you just like to tinker with projects?

As you can see all of those are very complex environments with unique needs, stresses, and usage spreads. And writing a blog entry which addresses whether each possible type of business should or shouldn’t use on demand computing would be impractical, not to mention impossible. But for the web industry there are a couple basic types of environments: “Sparse Write, Dense Read”, “Dense Write, Sparse Read”, with subtypes of “Data Dense” and “Data Sparse”

Environment: Sparse Write, Dense Read

For a lot of web applications you’re really not dealing with a lot of data. If you’re running a content management system or you’re a directory you have a finite amount of data which, in comparison with the number of times it’s read, is written to fairly infrequently. (In this case Infrequently written means that a databases query cache is a useful optimization for you.) It’s also very likely that you will be able to take a snapshot of your data in this type of environment in a fairly convenient manner. Compute as a service is probably right up your alley, and here’s why.

You are likely to have very normalized times during which your reads (or your writes) spike, meaning that you can actively plan for, setup, and use on demand resources to their fullest potential. Remember that an on demand resource is not an instant problem solver. In the case of something like Amazon EC2 it can take 5, 10, or 15 minutes for the server you’ve requested to even become active. After the server is up there has to be some process which gets all of the relevant data on it up to date. What this means is that you might be looking at 1/2 an hour before your 5 extra servers are ready to handle the 7:00 am to 9:00am traffic spike that everyone getting to the office in the morning generates. With your service, thats fine though. Just plan to turn the extra power on an hour early and turn it off half an hour after you expect the spike to be over. Wash rinse repeat.

Environment: Dense Write, Sparse Read

See this is the more complicated of the two environments. Everyone and their mother knows how to build a database driven application which gets a few writes and a lot of reads because thats what your common RDBMS are built for. Think of it as being idiot proofed out of the box 🙂 But when you have a backwards (as in usage, not as in technology) environment all of a sudden you have a lot of “conventional wisdom” which isn’t so wise anymore (what do you mean a faster write server than read servers causes replication problems?) (what do you mean my uber-normalization is the problem?).

It’s in this type of environment when we really have to look at the subsets of data, because the proof really lies in the pudding — so to speak.

Sub Environment: Data Sparse

You work with a relatively small window of data in realtime. You may or not get a request for all of the data you’re keeping continuously up to date, but you have to keep it that way or its your butt on the line, right? Well you’re probably in luck. I think it’s fairly likely that your data size is a relatively small one, for example you’re keeping a window with a period of 24 hours of data updated. Likely there is a *LOT* of history kept but thats kept elsewhere. Once you’re done with the data you shove it right out the backend into another process and it gets handled there (that backend is likely a sparse write sparse read environment which is extremely data dense — not for on demand computing (well maybe, but thats another blog post)).

For this environment compute as a service is probably going to be a godsend… if you can overcome one small, teentsy weentsy, ever so small yet still important detail: the development team. Now not all companies are going to have difficult development teams, but some do, and you simply cannot build an environment ripe for compute as a service without their cooperation, so be prepared whatever the case! You will likely be able to leverage hotcopy, or an LVM style live-action backup for insta-backups to your long term storage solution (or on-demand setup pool). You will likely be able to leverage the extra compute capacity for your peak load times. And everything will likely turn out OK. So long as you can get some of the crucial application details hammered out.

Sub Environment: Data Dense

I pity you. Compute as a service is probably not what you need. Cases may vary and, again, the devil is in the details. But you have a huge challenge ahead of you: Building an environment where a server can be programatically brought online and then caught up to date with the current compute pool in a time frame which makes even doing it a winning situation. This is something I’m going to put a lot of thought into… note to self… But unless you have some bright ideas here (and if you do, please send them my way) you have basically one chance: data partitioning. Get yourself a VERY good DBA, and really REALLY plan out your data. If you put enough thought into it in the beginning you have a chance to keep the individual pieces of data down to a small enough (and distributed enough) level which just might lend itself to compute as a service in a very LARGE way (but we’re really talking about going WAY beyond the 10 or 20 allowed Amazon EC2 server instances here)

Uh, Ok, Enough about these different environments… what do I need to do to USE on demand computing?

Well thats a difficult question to answer in a generally useful way. so without getting too specific:

You, first and foremost, need to have compute as a service thought about in every bit of your planning and executing stages. At every point in the set of long chains which make up your application you have to ask yourself “what happens if this goes away?” and plan for it.

A very close second is think pipes and hoses rather than chisel and stone. Each part of your environment should be as self contained as possible. When one hose springs a leak the solution is simple, replace the hose (and just bypass it for the mean time,) but when you loose a section of your monolithic structure things are a bit more complicated than that.

Finally you need to understand that you will have to work at taking full advantage of compute as a service. Remember that you are going to have to put TIME and ENERGY into using this kind of a service. Nothing comes free, and even in operations everything has an equal and opposite reaction. If you want to avoid spending the time and energy and money maintaining a hardware infrastructure you will have to put the same into avoiding one. But the benefits of doing so are real and tangible. Because when you’ve spent all of your time building an application which is fault tolerant rather than building an infrastructure which will fail you invariably provide to your user base a more robust and reliable service.