Well as my first step towards actually *using* EC2

I’m configuring myself a VMWare virtual machine running (well *instaling* right now) CentOS-4.4. Rather than worry about what to (and not to) install I’ve just opted for the 4CD “Everything” method. Hey… It works ;). I will be using this virtual machine to work with the EC2 tools. (yea… go figure. I work with Linux Servers, and BSD servers, but I dont have a Linux desktop. (I have stuck, recently, with OSX and Windows)

Yes… I know… Windows… But some things it’s just needed for. Blech

Soon I should have detailed accounts of running Amazons preconfgured instances, followed by accounts of creating a custom image and sending that back to amazon to run as an instance

Traditional OPS VS Amazon AWS (part 3)

So now we know that we need to look at things in terms of pools or “stacks” of resources… But how? Thats a good question! And from here on out… It gets technical (even though we’re going to be talking in high level generalities)

Now lets take on step back, and examine the tools that we have at our disposal

  • Amazon EC2 – scalable processing power, temporary storage
  • Amazon S3 – Scalable permanent storage, no processing power
  • Amazon SQS – Scalable queueing of “things”

First we need a control structure. We need a way to programatically interface with this potential pool of machines. And we need a way to, with as little hands on work as possible, be able to bring up and down the services that we need. For our mechanism of communication we will use Amazons Simple Queueing Services. According to Amazon: “Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable hosted queue for storing messages as they travel between computers. By using Amazon SQS, developers can simply move data between distributed application components performing different tasks, without losing messages or requiring each component to be always available

We’ll start large-scale and work our way down to the fine grained detail. Our global Queue structure. We’ll have one Global queue structure which will be used by our orchestrator to communicate with machines who have no assigned role yet, and then sub queues which relate to a specific server.

   [ Global Queue --------------------------------------------------- ]
   [ Web1 Q ] [ Web2 Q ] [ Db1 Q ] [ Db2 Q ] [ Sp1 Q] [ Sp2 Q ] [ O Q ]

The sub-queues will be used for monitoring as well as giving orders by the orchestrator, and the [O]rchestrator queue will be for subordinate servers to send messages back to the orchestrator

Oh, yea, him! At this point we have one machine known as the orchestrator. And it will be acting as the brain for the operation. It will be the one that actually manages the servers — Ideally they will require no intervention.

This orchestrator server will be configured to maintain a baseline number of each type of server at all times. It will monitor the vitals of the servers under its command, most important of which will be server load, and server responsiveness. If the average load of a given set of servers goes above a pre-configured threshold it will throw a signal down into the global queue asking for a new server of that type. If the average load drops below a given threshold it will send a message down the specific queue asking that a server be decommissioned. Also if a server is unresponsive it will call for a new server to be commissioned, and it will decommission the unresponsive server.

The last thing that the orchestrator will be responsible for is keeping a set number of spare servers idling on the global queue. The reason for this is responsiveness.

For example: If it takes 10 minutes for an instance to start up. And it’s being started because the web servers load is so high that you’re seeing unacceptably slow response times. Thats 10 extra minutes that you have to wait. But if you have a server Idling in “give me some orders” mode. The switch happens in just a minute.

So your orchestrator calls for a new web server. First it creates a new local queue for Web3. It then sends a request down the global queue pipe that a machine should assume a web server role and attach to the Web3 queue. This is the command to commission a new web server

Decommissioning the web server is basically the same, only in reverse. The orchestrator sends a message down the Web3 pipe asking for the the server to halt. The server responds with an OK once it’s finished its processing and is getting ready to actually shut down.

The rest of the magic happens inside the instance!

Since a server is kept at idle by the orchestrator, it’s basically sitting there monitoring the global queue for commission commands. Once per X number of minutes (or seconds) it checks the queue for such a command. And when it receives one is when the magic actually happens. In this case it’s a request for a web server on the Web3 queue. So the queue monitor runs a script designed to configure the machine to be a web server. The script grabs a copy of the proper document root and web server configuration and any necessary package from an external source (possibly an S3 storage container, possibly a subversion or cvs server, possibly just rsync’ing from a known source.) Once the local code and configuration has been updated, all the services required for running as a web server are started. Perhaps this is a) memcached, b) apache. Perhaps this is a) Tomcat, b) apache. Maybe its just starting apache. But thats not important. What’s important is that the server just changed itself, dynamically, into whatever was needed. It will then send a message up the Web3 queue that it is ready for action.

The orchestrator gets the message from Web3, perhaps registers it with a load balancer, or in DNS somewhere, and then goes about its business

On an order to decommission Web3 waits for connections to cease to apache (The orchestrator removed it from load balancing before the decommission request was sent). Turns off apache, turns off the supporting services. And then sends its log files out to some host somewhere (assuming you arent doing something more intelligent with standardized logging.) Web3 puts a message in the global queue that it’s halting. And it halts.

The orchestrator gets the message. Decommissions the Web3 Queue. And doesn’t think about it again until the next time that the web server load increases.

There is a *LOT* of detail that goes into setting a process like this up. And those details will change from server role to server role, and from organization to organization. But the general principle… Ought… To work 🙂

Ought? Well I haven’t tried it… … Yet

Traditional OPS VS Amazon AWS (part 2)

Our good friends at Amazon arrive on the scene. First they offered a stream of services which were interesting, but never really got your attention. But this last announcement. That one changes everything.

And it does change everything. Unfortunately people are thinking of the service in terms of their current ideas of scaling infrastructure. And I have a sneaking suspicion that it doesn’t work that way

The most obvious difference is bandwidth. You pay a definite premium on bandwidth when you use amazons web services, and EC2 is no exception. However one more sinister, sneaky, and diabolical quirk lies in the way. You see bandwidth is something that you may well be willing to pay a premium form. And its also mostly circumventable with judicious use of static content hosting, compression, caching, and the like. But permanence is something that most people take for granted. And permanence is exactly what EC2 lacks. This is the caveat which will start to change how people think about operational resources. And probably for the better.

Before I go into my current theories regarding the overall usage of Amazons web services to their fullest potential… Which will be part 3… Lets talk about how Amazon views all the things you’ve been taking for granted differently.

First let me state how the normal data-center works: It works via units. Most OPS departments consider the data-center operational unit to be one server. But most data-centers never reach the size and complexity of those attained by Amazon, Google, Etc. And they have to think outside several very limiting boxes. Consider the planet as an analogy. It wasn’t too long ago, in geological time frames, when the idea of running out of space (or any other resource) was seemingly absurd. However fast forward a thousand or two years to the present day, and we’re looking to tackle some very difficult problems head on: dealing with finite resources and a growing population. Just as its a certainty that, eventually, man will outgrow earth. So will any company outgrow their data-center in two or three distinct phases

Phase 1: We just need servers. At this point it’s basically a just get *something* game. You have *very* limited resources… and you get whatever you can. Right now is NOT the time to be worrying about how t spend money, but how to make it!

Phase 2: After several iterations of Phase 1… The diversity of our data-center is just unmanageable! At this point you realize that you need uniformity in the data-center to keep scaling… and hopefully you can afford it. Start phasing out the old patchwork network of randomly configured machines with shiny new identical machines. Enter a state of bliss for a good while

Phase 3: We’ve just realized that we cannot depend on unified hardware for that long. Our supply of this configuration of machine just ran out! The idea of throwing out 200 servers so that you can buy 500 more (for a gain of 300 servers) for the sake of this idea of uniformity makes you sick. And purchasing a thousand so that they can sit in a warehouse for use later is ridiculous as well… You’ll be running on 2 year old new servers in 2 years! Thats daft! There has to be a better way!

And there is. At this point the big boys stop thinking of a data-center in terms of machines. Break the machine into smaller units. Processor gigahertz, memory megabytes, processor gigabytes. Now lump all those together into a stack. And you have ?Ghz + ?Mb + ?Gb distributed throughout your data-center. But you still don’t have the complete solution. The last piece falls into place with virtualization. What this last piece of the puzzle allows you to do is to divvy out those individual resources as you see fit, and on the fly.

And thats what Amazon has done. They’ve pooled their processor power into one giant stack. They’ve pooled their memory into another. Their hard disk space into another. And they found that in this way they can hand out slices of those stacks on demand. And that they can also reclaim those slices on demand. So if you want to take advantage of the infrastructure that amazon is providing: stop thinking that a machine is a specific thing and only that thing

You’re in the world of on-demand. Lets start thinking in terms of what that means and how to use it to your advantage!

And I’ll begin that thought process in part 3

Traditional OPS VS Amazon AWS (part 1)

Lets take a look at your traditional web application setup (during development, or early stages with no funding). You might find something like this. We have a web, database, and specialty server. The specialty server is probably doing whatever makes the company unique. For simplicities sakes we’ll say its just crunching data from the database

The “Server Farm” “starting out”

+------------------------------
|-[ Web Server       ]
|-[ Database Server  ]
|-[ Specialty Server ]
+------------------------------

You bring in a little ad revenue, or a partner with a bit more money, and you expand… You double your capacity!

The “Server Farm” “expanded”

+------------------------------
|-[ Web Server       ]
|-[ Web Server       ]
|-[ Database Server  ]
|-[ Database Server  ]
|-[ Specialty Server ]
|-[ Specialty Server ]
+------------------------------

Eventually you end up with something like this:

  • 2 load ballancers
  • 30 specialty servers
  • 16 web servers
  • 12 database servers

The above is a classic solution to the “we need more power” problem. And it’s worked for a long time. However

  • It’s grossly inefficient as far as computational resource usage goes

    Because you add on 20 new servers in *anticipation* of the load you will need in 4 months. So during month 1 you’re using 10% of those servers, month 2 30%, month 3 60%, and you find out at month 4 that you’re at about 120% utilization (hey having the problem of being successful isnt the worst thing hat could happen… but it sure is stressful!)

  • It’s grossly inneficient as far as personelle resource usage goes

    Because your OPS team (or OPS Guy as the case may be) has twofold problems: First problem is that these machines are being purchased with months between them. And since there’s very little money in a startup they’re being purchased on sale. Which means you get whatever you can afford. Which ultimately means that after a year you have 9 different machine builds and with 7 different configurations, 2 processor types, 6 different network card types, many different RAM latencies, differences in CPU Mhz, disk speed, chasis, power consumtion, etc, etc, etc

    Which means that "ballancing the resources properly" becomes a touch juggling act. That "diagnosing problems possibly due to hardware&quot becomes a long expensive drawn out process. Which means that installing OS’s on machines (for iether repairing dead drives, or repurposing old servers) means having to deal with an entire encyclopedia of quirks/exceptions/nuances/hinderances.

    Trust me that it’s *very likely* that by the 4th batch of hardware bought you’ll start having to deal with the "oh… yea… thats from the second batch… you have to do [whatever] before it’ll work" By the 8th batch its a quagmire of re-learning all those old lessons.

  • It’s grossly inefficient in terms of resource utilization

    If you only use 20% of your web servers abilities… except for 2 days a month when you’re at 80% pushing 90%… You’re loosing a HUGE amount of valuable CPU cycles, RAM, and hard disk space. And what you don’t want your OPS team to be doing is constantly repurposing machines. OPS people are human… eventually they’ll make a mistake. And you wont like it. Plus they have important time consuming jobs aside from having to spend 2 weeks every month re-imaging servers for a 2 day utilization spike, dont they?

Queue dramatic music

Enter the Amazon team

… to be continued

Learning a fact is easy, Learning to think is hard

Often times theres something that will happen throughout my day, and It’ll spark in me to talk, yet again, about how to learn.

Most of the times when someone is considered "smart" it’s because they know a lot of things. Having a good memory, yes, is indicative of a smart person. But it’s not that uncommon to find people who can remember detail to the Nth degree who arent very good thinkers

And thinking is what makes a person smart.

Lets be clear, here: being able to read a book and then remember all of its contents is *nice* but that does not make a person smart. What makes a person smart is being able to apply what’s in the book to varying situations. Read that last bit again. I didnt say that remembering the books contents and being able to apply them was what makes a person smart, remembering wasnt even a part of it. I also, specifically, mentioned varying situations. Being able to remember that a source (book, article, web site, etc) touches on a subject is quite arguably more important than being able to remember what that source says about the subject. Why? I’m glad you asked.

Very few references (and most everything is a reference these days) tell you how to think about a subject. References simply give you information about a subject. I’ll refer to “knowing about a subject” as “learned.” So a reference can make you a “learned” person, but it cannot make you an “intelligent” person. To be intelligent requires application and thats something that a reference simply cannot provide.

But to be able to look at one problem… lets say… mow many apples you can buy for $11… and to reach back into your “learning” and come up with the idea that cross multiplication can tell you how many. Thats “intelligent.” Even if you then have to go look up how to do it again making that connection is the key to intelligence.

I’m sure that Mark Twain would agree with me that too many folk walk around proclaiming the virtue of intelligence when, in fact, possess only the sin of regurgitation

The bottom line is that if you desire to learn to be intelligent, stop trying to memorize books, and start looking for the relationships around you. How the grass relates to the rain. How the wind relates to the chimes. How the time of day relates to the temperature relates to the month. Those are intelligent thoughts to have. That the wind chimes make sound of C-minor is a learned, though not necessarily intelligent, thought.

Amazon EC2 – “oh shit bars”

In this article Isabel equates Amazons EC2 to a vehicles “Oh shit bars.” Which, I think, Is a very valid use of the service. But lets not overlook the dirty little secret of “web two point oh” which is: Theres a *LOT* of data to be crunched.

There… I said it. See, people dont want you to understand that its actually fairly hard to maintain a growing web2 app. This whole Idea of social networks generates a WHOLE LOT OF DATA. And storing it is less the problem than analyzing it.

You see its not very good for making money to say “Working with all this data is hard.” Even when it should be. I think VC’s (who pay most 2.0 paychecks, AFAIK) like to hear “not only is it innovative… I could train a monkey to do it its so easy” Which is a load of crap because if it were then *everyone* would be doing it, and it would be Web-17.5_release_candidate_14 — If you catch my drift.

Like I always say a database at 5GB doesnt behave the same way once it reaches 500GB, and then at 5TB it’s another beast yet again. People talk about amazons EC2 as a utility computing platform — which it is — and then describe it in terms of web hosting. I think that misses the boat entirely. Yes you could use it for your web hosting needs. And I’m sure it will be good at that. But the gap that EC2 really fills isnt that one.

Right now people are saying: “Look, its a new tree!”, and later on they’ll be wondering “where did this blasted forest come from?”, and further down the road still our children will be laughing at us saying “It’s the Amazon you old fogeys… of course its got trees!”

All this talk about Amazon EC2 and bandwidth

Let me share some perspective on what bandwidth numbers actually mean… Because talking about 1000Gb is a term that doesnt actually connect with anything in people minds

1000Gb/month means that on a 31 day month you are sending out (numbers rated as 1000 equals 1k because thats prevalent)

  • 373Kbytes per second…
  • 22Mbytes per minute
  • 1.3Gbytes per hour
  • 32Gbytes per day

or the ability to

  • broadcast a 256bkit audio stream to 11.5 users around the clock
    (remember 8 bits == 1 byte)
  • send out over 1,500 CDROM images
  • send out over 200 DVD images (single sided)
  • send out almost 286,000 MP3’s (3.5Mb each)

In otherwords… its a lot of bandwidth… assuming i’m not so tired that my numbers are flawed (which is very possible)

so… at EC2 to send out 5KB/sec for the same month would cost $2.68. 50KB @ $26.80. 500KB @ … well you get the idea.

One last comparison before I head off to bed… At EC2, for $160/month (comparable to a mid-level hosting provider), you get about 365Gb xfer per month… or 136KB/sec for the entire month (or 4 256Kbit audio streams, 547 CDs, 54 DVDs, 101,714 mp3’s, or one 365,000,000 letter text file ;))