Remember when I mentioned that we’ll be seeing some disilusionment?

Well this morning I found my first piece of it. A blog post (which I’ve since closed and dont feel like finding again) mentioned that the “fly in the ointment” was that he wouldnt be able to load an exceptionally large database into the EC2 grid.

The reason for this complaint is the limit of an image to 160Gb (apparently) so minus about 4Gb for the OS and you have 156Gb worth of space on which you could but a database. Now to a lot of you thats seems like a lot of data… and it is… but if you’re processing large amounts of data (research, social stuff, search, etc, etc) 500GB, 1TB, 2TB, 4TB arent unheard of.

Scientific applications aside the most likely place you’ll see people desiring dtabases lartger than 160Gb is probably going to be in the web 2.0 startup group. And I really hav to question the validity of their gripe. If you call in the heavy hitters at MySQL (And I know this, because I’ve worked in a company in which I *HAVE* called in the heavy hitters at MySQL) They’ll tell you, invariably, to spread out your data.

The reasons for spreading our your data are numerous, but the 2 big ones are physical scalability and speed. Lets assume you’re developing your web app… and right now the data is 500Mb. Its fast, its responsive, life is wonderful. Even on your poorly designed monollithic schema. But you take off and start growing at 2x per week! Wonderful! Traffic is GOOD! … … … well…

  • 500Mb
  • 1GB
  • 2GB
  • 4GB
  • 8GB
  • 16GB
  • 32GB
  • 64GB
  • 128GB
  • 256GB
  • 512GB
  • 1TB

Not only does it become increasingly difficult to find a place to store (and backup) your data… its gotten SLOWER… the indexes arent enough, things take a long time… and you dread running an ALTER TABLE more than your mother in law (disclaimer: I like my mother in law just fine)

The solution is to design your schema in such a way that breaking it into pieces is a natural process! If you do this you’ll be ready to maintain your users table in one database cluster on one set of hosts… the log files in another… and various pieces of your data in various places. How to develop a schema which can do this is a whole nother blog post (one which I’ll be very happy to write up if someone requests it)

The point is this: Just because someone has a valid gripe with the service doesnt necessarily mean that its well founded. Try and think through whats being said for yourself — you’ll end up much happier that way.

Oh, and if the OP is reading this… Perhaps I’d be willing to comment/help on your schema? 😉

Condensing the steam (smoke) from this mornings EC2 posts

Generally speaking there is a bit of comparing EC2 to Suns GRID system… A couple mentions of oracle. A link to a screencast about EC2. A few people griping about the idea that (for now) Microsoft Windows might not be an option on EC2 systems…

The most interesting offerings I found this morning were: a look at the EC2 TOS, and how unrestrictive it seems to be, and a walkthrough of how setting up a new instance actually happens, and that screencast (not because it really shows anything… but because it shows something)

And a whole lot of people (like myself last night) regurgitating the specs and prices

When you should (and should not) think about using Amazon EC2

The Amazon AWS team has done it again. And EC2 is generating quite the talk. Perhaps I’ve not been watching the blogosphere closely enough about anything in particular until now (very likely) but I’ve not really seen this much general excitement. The ferver I see going around is alot like a kid at christmas. You unwrap your present. ITS A REMOTE CONTROLLER CAR. WOW! How cool! All of a sudden you have visions of chasing the neighborhood cats, and drag racing your friends on the neighborhood sidewalks. After you open it (and the general euphoria of the ideas start to fade) you realize: this is one of those cars that only turns one direction… And you just *know* that the next time you meet with your best friend bobby he will have a car that turns left *and* right.

I expect we will see some of this… A lot of the talk around the good old sphere is that AWS will be putting smaller hosting companies out of business. But thats not going to happen unless they change their pricing model. Which i doubt they will.

But before all you go getting your panties in a bunch when EC2 only turns left… Remember that EC2 is a tool. And just like you wouldn’t use a hammer to cut cleanly through a board. EC2 is not meant for all purposes… The trick to making genuinely good use of EC2 will be in playing off of its strengths… And avoiding its weaknesses.

Lets face it… The achillies heel of all the rampant early bird speculation is that the price of bandwidth for EC2 is rather high. Most hosting companies get you (with a low end plan) 1000Gb of transfer per month. Amazon charges $200 per month for that speed, whereas you can find low-end hosting for $60, and mid end hosting got $150. Clearly this is not where EC2 excells. And I dont think that the AWS team intended for it to excell here. How big of a headache would it be to run the servers which host every web site on the planet? Not very.

What you *do* get at a *great* price is horsepower. For a mere $74.40/month (assuming 31 days) you get the equivalent of a Xeon 1.75Ghz with 1.75Gb Ram. Thats not bad!

but the real thrill comes with the understanding that additional servers can talk to eachother over the network… for free. There is a private network (or QV) which you can make use of. This turns into a utility computing atom bomb. If you can monimize the amount of bandwidth used getting data back and forth to and from the machine, while maximizing its CPU and RAM utilization, then you have a winning combination which can take full use of the EC@ architecture. And if your setup is already using Amazon’s S3 storage solution… Well… Gravy

Imagine running a site like, say, youtube on EC2. It would kill you with the huge bill. the simple matter of the situation is that youtube uses too much bandwidth in the receiving and serving of its users files. I would have to imaging that the numbers for its bandwidth usage per month are staggering! But lets break out the things that youtube has to manage, and where it might be able to best utilize EC2 in its infrastructure.

Youtube gets files from its users. Converts those files into FLV’s. And then makes those FLV’s available via the internet. You therefore have 3 main actions that are preformed. A) HTTP PUT, B) Video Conversion, and C) HTTP GET. If I were there, and in a position of evaluating where EC2 miht prove useful to me I would probably be recommending the following changes to how things work:

First all incoming files will be uploaded directly to web servers running on EC2 AMIs. Theres no reason it should be uploaded to a Datacenter, and then re-uploaded to EC2, and then sent back down to the Datacenter — that makes no sense. So Users upload to EC2 Servers.

Second the EC2 compute farm is in charge of all video conversion. Video conversion is, typically, a high memory and high cpu usage process (as any video editor will tell you.) And when they built their datacenter I can assure you that this weighed heavily on their minds. You dont want to buy too many servers. You pay for them up front, and you pay for them in back as well. Not only do you purchase X number of servers for your compute farm but you have to be able to run them, and that means rack space and power. Let me tell you that those two commodities are not cheap in datacenters. You do not want to have to have servers sitting around doing nothing unless you have to! So how many servers they purchase and provision every quarter has a lot to do with their expected usage. If they dont purchase enough then the user has to wait for a long time for his requests to complete. Too many and you’re throwing away your investors money (which they dont particularly like.) So the ability to turn on and off servers in a compute farm only when they are needed (and better yet: to only pay for them when they’re on) is a godsend. This will save oodles of cash in the longrun.

At this point, as a side note, I would also be advising keeping long term backups of content in the S3 service. As well as removing rarely viewed content, and storing it in S3 only. This would reduce the amount of space that is needed at any one time in the real physical datacenter. Disks take up lots of power, and lots of space. You dont want to have to pay for storage you dont actually need. The tradeoff here is that transferring the content from S3 back to the DC will cost some money. So the cost of that versus the cost of running the storage hardware (or servers) youselves ends up being. I would caution that you can move from S3 to a SAN, but moving from a SAN to S3 leave you with a piece of junk which costs more than your house did ;D.

Third the EC2 servers upload the converted video file, and thumbnails to the primary (and real) datacenter. And it’s from here that the youtube viewers would be downloading the actual content.

That setup would be when you *DO* use Amazons new EC2 service. You’ve used the strengths of EC2 (unlimited horsepower at a very acceptable price,) while avoiding its weaknesses (expensive bandwidth, and paying for long term storage (unless S3 ended up being economical for what you do))

That said… There are plenty of places where you wouldnt want to use EC2 in a project. Any time you’ll be generating excessive amounts of traffic… you’re loosing money compared to a physical hosting solution.

In the end there is a lot of hype, and theres a lot of room for FUD / Uninformed Opinions (this blog post, for example, is an uninformed opinion — I’ve never used the service personally,) and what people need to keep in mind is that not every problem needs this solution. I would argue that its very likely that any organization could find one or (probably) more very good uses for EC2. But hosting your static content is not one of them. God help the first totally hosted EC2 user who gets majorly slashdotted ;).

I hope you found my uninformed public service anouncement very informative. Remember to vote for me in the next election 😉

cheers
Apok

As a side note…

I’ve also subscribed to an Amazon EC2 search at pubsup, technorati, and feedster. I’ll be watching the 3 to see which yeilds the most A) unique, and B) valuable results over time.

At the time of subscription
PubSub: 0 New Hits, 0 Good, 0 Bad
Technorati: 20 New Hits, 15 Good, 5 Bad
Feedster: 10 New Hits, 9 Good, 1 Bad

Here are some links which, at a glance, appeared interesting

over at Maluke Co. The Server spec’s are reportes as being "an instance is roughly equivalent to a system with a 1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth (bursting to 1Gb)" — I have not verified this, but it seems to be corroborated by other blogs. Thats an interesting find.

Oh… And…

I would also expect that, to save money for their users you will be seeing FreeBSD, Gentoo, Redhat, Fedora, Ubuntu, and Debian mirrors by EC2 and for EC2… (e.g. no traffic allowed except inside the DC.) And if Amazon is *SMART* they will provide this service to the distributions at no charge. This will, almost assuredly, spark a large amount of interest, and probably gaurantee more users.

If not the person who mirrors all these distros and lets people use them for $5.00 per month will make a buttload of cash 🙂 (hmm… that could be me! :))

Time will tell.

The internet is all a-buz about amazon

And rightly so! Amazon has, quite simply, outdone themselves this time. However before we go running through the streets lets make sure the emperor really does wear clothes.

A very common configuration of server provided by hosting companies is: 1 Server, 1Gb ram, 80Gb HDD, 1000Gb transfer. This runs $70-$80 for low end hosting, and $150-$200 for middle of the road grade hosting (Both prices per month). For the sake of our first comparison lets assume you youse every last GB of your bandwidth every month, and not a drop more.

Amazon EC2


1 Server @ $0.10/Hr (31*24*.1) = $74.40/mo
80Gb Hdd @ $0.15/Gb (80*.15) = $12.00/mo
1000Gb Xfer @ 0.20/Gb (1000*.2) = $200/mo


Total Cost Per Month: $286.40

Which is a little bit more costly than you’re used to, to be sure. Obviously the less bandwidth you use the less you pay for it. Where the *real* benifit of the EC2 service comes into play is in environments which are not bandwidth intensive, but CPU intensive. How much of a benifit this will provide remains to be seen. I am unaware of the stats on these “machines” E.G. CPU horsepower available, and RAM (as well as RAM speed) will be determinant in how useful this service is. If those (CPU/RAM) arent directly applicable terms as far as “MHZ” and “GB” go in the way that we’re used to dealing with them there will have to be some QV measure other than “Feels pretty fast.”

I predict that there will be several strategies, in the near future, for limiting the amount of bandwidth to these services (as the main cost of the service is in raw bandwidth used) Storing static content elsewhere will (i’m sure) be a key ingredient in these strategies.

I also predict that it will be a relatively short amount of time before we see some sort of beowulf or openmosix style implimentation geared directly at this service. Mark my words 🙂

More to come when I find out more — I’m *severely* interested in this… and I’m sorely dissappointed that I missed the beta (probably by only a couple of hours too…)

Bash wizardry: Command Line Switches

If you’re like me (and God help you if you are) You write a lot of bash scripts… When something comes up bash is a VERY handy language to use because it’s a) portable (between almost all *nixes), b) lightweight, and c) flexible (thanks to the plethora of linux commands which can be piped together) One large reason people prefer perl (or some other language) is because they’re more flexible. And one of those cases is processing command line switches. Commonly bash scripts are coded in a way which makes it necessary to give certain switches as a certain argument to the script. This makes the script brittle, and you CANNOT leave out switch $2 if you plan to use switch $3. Allow me to help you get around this rather nasty little inconvenience! (note: this deals with SWITHCES ONLY! *NOT* switches with arguments!)

check_c_arg() {
  count=0
  for i in $@
    do
      if [ $count -eq 0 ]
        then
          count=1
        else
          if [ "$i" = "$1" ]
            then
              return 1
          fi
      fi
  done
  return 0
}

This beautiful little bit of code will allow you to take switches in ANY order. Simply setup a script like this:

#!/bin/bash
host="$1"

check_c_arg() {
  count=0
  for i in $@
    do
      if [ $count -eq 0 ]
        then
          count=1
        else
          if [ "$i" = "$1" ]
            then
              return 1
          fi
      fi
  done
  return 0
}

check_c_arg "-v" $@
cfg_verbose=$?
check_c_arg "-d" $@
cfg_dry_run=$?
check_c_arg "-h" $@
cfg_help=$?


if [ $cfg_help -eq 1 ]
  then
    echo -e "Usage: $0 [-v] [-h]"
    echo -e "\t-v\tVerbose Mode"
    echo -e "\t-d\tDry run (echo command, do not run it)"
    echo -e "\t-h\tPrint this help message"
    exit 1
fi

if [ $cfg_dry_run -eq 1 ]
  then
    echo "ping -c 4 $host"
  else
    if [ $cfg_verbose -eq 1 ]
      then
        ping -c 4 $host
      else
        ping -c 4 $host 1>/dev/null 2>/dev/null
    fi
fi

In the above all of the following are valid:

  • 127.0.0.1 -v -d
  • 127.0.0.1 -d -v
  • 127.0.0.1 -v
  • 127.0.0.1 -d
  • 127.0.0.1 -h
  • 127.0.0.1 -h -v -d
  • 127.0.0.1 -h -d -v
  • 127.0.0.1 -h -v
  • 127.0.0.1 -h -d
  • 127.0.0.1 -v -h -d
  • 127.0.0.1 -d -h -v
  • 127.0.0.1 -v -h
  • 127.0.0.1 -d -h
  • 127.0.0.1 -v -d -h
  • 127.0.0.1 -d -v -h

I hope this helps inspire people to take the easy (and often times more correct) path when faced with a problem which requires a solution, but not necessarily a terribly complex one.

Cheers!
DK