Archive for October, 2006

Googles new zen-like philosophy: Be the Internet

Random Thoughts | Posted by apokalyptik
Oct 31 2006

Picture the internet

Picture yourself as the internet

Now be the internet

At least thats what I figure google is doing. From dark fiber, to databases. From word processing to spreadsheets. From blogs to email. From widgets to wikis. From web ads to web stats. From videos to version control. Google is looking to BE the internet. It makes sense too. If you have to choose between spending the time to actively acquire new information in a space growing with the ferver of the internet versus passively already-having the information, which would you choose? I’d choose passive ownership!

What better way to be able to search the worlds information than to have that information already passing through the very arteries of your network of its own volition. Google turns into the worlds most advanced stateful packet sniffer. Poof. Google is now not more than one relationship awway from almost every single information resource on the planet.

What would google have to do to complete the coup? First “finish” and “polish” its products to get a large enough userbase on all of them for the viral-style-marketing-by-necessity “advertising” (but thats not the right word) to work. Next buy amazon, and e-bay (?) outright. Third. Work on not getting “acquired” by various governments itself (otherwise we will have something like a very dangerous big-brother on our hands).

Is this a far fetched conspiracy theory? Probably. But thats what you get with a mind that doesn’t have an off button and “downtime” in the shower. Heh. Is it likely? Maybe not, maybe. But is it an intriguing idea? Definitely!

Ruby, Rails, and brute force

Random Thoughts | Posted by apokalyptik
Oct 31 2006

I’ve started teaching myself to work with Ruby on Rails. And since I’m a hands on guy I tend to get myself into odd situations and have to brute force myself out. For example last night I was dealing with trying to display a number of records from a database (there were only 3 records) and all I could get it to display was

###

Well it turns out that when using embedded ruby (eRuby) these two blocks of code are VERY different.

## This works
<% for i in Network.find(:all) %>
  <%= h i.name %>
<% end %>

and

## This doesnt
<%=
  for i in Network.find(:all)
    h i.name
  end
%>

Ahh, it’s always the little things in life which give you the greatest joy and the greatest frustration!

Animated Gif Maker – Cool Indirect Ajax Tool

Random Thoughts, Software Development, Web Stuff | Posted by apokalyptik
Oct 20 2006

It’s definitely neat when someone can hack together something like the animated AJAX loading GIF generator in some spare time and release it to the world. I also think that the wake that AJAX is leaving has a lot of room for tools like this one. Very spiffy. Thanks!

BWUahahahahaha

Business, Funny Stuff, In The News, Security, Software Development, Web Stuff | Posted by apokalyptik
Oct 19 2006

Downloading a new browser: $0

Loosing your old standby browser: $0

Hoping you can use your machine after the next reboot: $0

Getting to be the QA engineer for one of the richest companies ever: PRICELESS

QMAIL-TOASTER remote redilivery loop problem

Business, Linux, Personal, Security, Software Development, Web Stuff | Posted by apokalyptik
Oct 19 2006

I recently switched from my old gentoo server to a new FC5 server. I opted to go with a qmail-toaster setup because, while I’m perfectly capable of manually making my desired qmail+vpopmail setup, I just didn’t want to spend the personal time doing it. So I figured I would give the toaster project a try. And I have to say that I’m fairly impressed.

A lot of the core technological things that it did were done in basically the same way that I would have done them manually (which is bidirectionally gratifying for me) and there are some bells and whistles that are *nice* but I wouldn’t have bothered setting them up on my own (e.g. qmailmrtg graphical log analysis.)

I did (hopefully did and not still do) have one oddball problem with it. After switching over there were certain servers from which I would continuously get the same message over and over from. Everything in my logs showed a successful delivery, and its not as though the messages were stuck in my queue either, the remote servers would actually reconnect and deliver the message again.

Well for a while I had better things to do with my scant time than deal with this one inconvenient (but not critical) issue. Well today I finally cracked. Its probably because I’ve now gotten one particular message something on the order of 30 times now. Thinking about the problem, and examining my logs it seemed that the only time this happened was when a message was processed by simscan for viruses (clamd) and spam (spamd) at the SMTP transmission level. But that was not the complete story because other messages from other servers did not have this problem even though they went through simscan as well.

On a hunch I figured that the sending mail server was probably only designed to wait X number of seconds (or microseconds) after the finished transmission before expecting to get a status code back from my SMTP daemon. If it takes too long then the remote sending server might just assume the connection was lost and re-queue the message for redelivery. So I disabled spam and virus scanning in simscan

#echo ":clam=no,spam=no,spam_hits=12,attach=.mp3:.src:.bat:.pif" \
  > /var/qmail/control/simcontrol
# /var/qmail/bin/simscanmk
# /var/qmail/bin/simscanmk -g
# qmailctl restart

And the problem *seems* to have gone away. I’m not worried about viruses at this point because I’m running OSX as my desktop, and Thunderbird is usually pretty good about spam… so… no loss for me there.

I’m mainly writing this down here so that if someone were to have this problem, and floundering while searching for an answer, they might have a better chance of finding a helpful hint. Searching for things like redelivery and mail loops on google will yield nothing of any value at present.

Cheers
DK

Series: CRM on S3 & EC2, Part2

Amazon AWS, Business, MySQL, Random Thoughts, Security, Software Development, Web Stuff | Posted by apokalyptik
Oct 14 2006

So we’ve touched a bit on what to look for in your database. The comments made were by no means specific, and the requirements will vary from place to place. But the underlying principals are what are really important there. Now lets move on to something a bit more specific. Backup.

There is an important caveat to this information: Nobody has done this enough to really have a set of scalable one-size-fits-all tools (or a tool chain) fit for the job… You’ll have to be OK with doing some in-house experimentation. And be OK with the idea of maybe making a couple of miss-steps along the way. As is the case with any new (OK new to YOU) technologies there are some things you just have to learn as you go.

To setup a system that is fault tolerant, and to develop a system in which you manage your risks requires a balance of acceptable versus unacceptable trade off situations. Your main types of backups are as follows:

A) Simple local backup. your old stand-by tar and his friends bzip2, gzip, and even compress. They’ve been doing backups since backups were backups (or almost anyhow) and they are good friends. In this kind of a situation they aren’t the whole solution but you can bet your butt that they’re a part of it.

B) Hard-Copy backup. This isn’t what you want, but worth mentioning. This kind of backups consists of hard disks, tapes, CDs DVDs, etc, which are copied to and then physically removed from the machine. The advantage to this type of backup is that you can take them offsite incase of a local disaster, but in an EC2+S3 business there is no such thing as a local disaster. So if you, once per week/month/whatever, just copy down your latest backups from S3 that should suffice.

C) Copy elsewhere backup. This is going to be bread and butter for the bulk of the solution. It’s not the entire solution. But it’s a fairly big piece. In this case S3 is your “elsewhere”

D) Streaming backups. Examples of streaming backups are MySQL’s replication, or pushing data into an Amazon SQS pipe for playback at a later point. Also a key player in what will surely be your ending strategy.

Well that was fun. Learning just enough to be dangerous but not enough to actually do anything… And certainly not enough to answer the question. So lets get to it.

You will have two distinct areas of backup which will be important to you. You have the UI end, and the DB end. Both these sections should be approached with different goals in mind, because the usage pattern on them ends up being different.

The Front End

You’ve no doubt got a development environment setup somewhere, and as you make bug fixes to this environment, or add features, or change layouts to announce your IPO, or whatever you need to push a snapshot to your servers *AND* any new servers you bring up need to have the new UI code and not the old UI code.

For the sake of argument, here, I’ll assume that you have a SVN or CVS server which holds your version-controlled code (you *ARE* using version control right?) So your build process should, ideally, update the stable branch on your Revision Control Server, and send out a message to your UI servers that an update is available. They should then download the new code from RCS to a temporary directory, and once there you pull the fast-move trick:

$ mv public_html public_html.$(date +%s) && mv public_html.new public_html

At this point all of your UI servers received the message at the same time, and update at the same time. Any new server should have, in its startup scripts sometime after the network is brought up, a process which performs the above update before even bringing up your HTTP service.

And that was the easy part… Now for MySQL

As for MySQL, I’ve outlined my thoughts on that here already in my article: MySQL on Amazon EC2 (my thoughts) Which options you choose here depend on a couple of things: First the skill level of the people who will be implementing the entire procedure *AND* the skill level of the people who will be maintaining it (if those people aren’t the same people). But one very serious word of caution: Whatever you do stop thinking of an EC2 instance as 160GB of space for MySQL and start thinking of it as 60Gb (70GB MAX) because backing up something that you do not have the space to copy is a difficult task which normally required bringing things offline — trust me on this.

My gut feeling for you is that if you owned/rented one physical server to be your write server for your database setup. something roughly equal to the specs of the EC2 virtual machine, except with 320Gb of disk space. That would be your best bet for now. You could keep your replication logs around for the entire history of your database… for a while

You also should keep one extra MySQL instance (on EC2 if you like) up and running for the sole purpose of being up to date. You would then periodically turn it off and copy the entire thing up to S3. So that when you had to restore a new instance you would simply copy those files down, assign the server-id, and let it suck everything new down via replication.

The gotcha here is that this wont last forever… at least not on one database. There will come a time, if you get a “lot” of usage, when the process of downing a server copying it, copying it, bringing it up and waiting for replication will become infeasible. It will eventually just stop adding up. It’s at that point you will have to make a couple of careful choices. If you have properly laid out your schema you can pull your single monolithic database apart, distribute it amongst several database clusters, and carry on as you have been. If you have properly laid out your schema in a different way you will be able to assign certain users to certain clusters and simply write a migration tool for moving users and their data around between database clusters. If you have not properly laid out your data you can choose whether to spend time and money re-working your application to make it right. Or you can spend time and money on buying big “enterprise class hardware” and give yourself time to make things right.

Unless you can truly count on being able to bleed money later on. You’ll VERY CAREFULLY consider your schema now. It will make all the difference. And if you end up with 2+TB of data which is completely unmanageable… well don’t say I didn’t warn you… Those kinds of optimizations may seem silly now when you’re only expecting 5-25GB of data but they wont be silly in 2-4 years.

Series: CRM on S3 & EC2, Part1

Amazon AWS, Business, Linux, MySQL, Random Thoughts, Software Development, Web Stuff | Posted by apokalyptik
Oct 11 2006

Danny de Wit wrote in with a request for collaboration on how to best use EC2 and S3 for his new Ruby On Rails CRM application. And I’m happy to oblige.

At this point I dont know much about what he’s doing, so I hope to start rough and open a dialogue with him and work through the excersice over a bit of time.

The story so far

We have a rails front end, a Dabatase backend, EC2, and S3

Well… that was a quick rundown…

Summary of what we will need to accomplish the task on S3 and EC2

First off we will need to be able to think outside the traditional boxes. But I think Danny is open to that. Second we will need to deal with the database itself. Third We have to deal with the issue of dynamic IP addresses. Fourth we have to deal with some interesting administrative glue (monitoring, alerting, responding) Fifth we have to deal with backups. And finally we have to deal with code distribution.

Now, Where do we start?

First we should start with the database. I wont lie to you, most of the challenge in regards to using these services will be centered around the database. We need to examine how it’s laid out, how its accessed, and what our expectations are when it comes to size. Specifically what we need to look for are two main things: A) bottlenecks, and B) data partitioning strategies.

Bottlenecks. We have to examine where we may or may not have trouble as far as data replication goes. Because if we are making hourly backups and we have to bring up another server at the half hour marker we’re going to have to have a strategy in place to bring the data store up to date. And the layout of the database can make this particularly prohibitive or it could make this very easy. And besides… having a bunch of servers doesnt help if they cant stay in sync.

Data partitioning. It’s easy to say “later on we’ll just distribute the data between multiple servers” but unless you’ve planned for a layout which supports this you might have a particularly difficult time doing so without makor reworking on your application. Also data partitioning can be your friend in the speed department as well. If you’re thoughtful about HOW you store your daya you can use the layout itself to your advantage. For example a good schema might actively promote load ballancing where a bad schema will cause excessive load on particular segments. A good schema will actually act as an implied index for your data, and a bad schema will require excessive sorting and indexing

So what now?

So, Danny, the ball is in your court. You have my e-mail address. You have my blog address. Lets get together and talk database before we move forward into the glue.

Random Musing: Bluring the Line Between Storage and Database?

Amazon AWS, Business, Funny Stuff, Linux, MySQL, Personal, Random Thoughts, Software Development, Web Stuff | Posted by apokalyptik
Oct 10 2006

As food for thought…

If you had a table `items`

  • itemId char(40),

  • itemName varchar(128),

Another table `tags`

  • tagId char(40),

  • tagName char(40),

And a third table `owners`

  • ownerId char(40),

  • ownerUsername char(40),
  • ownerPassword varchar(128),

It would theoretically be possible to have an S3 bucket ItemsToTags inside which you put empty objects named (ownerId)-(itemId)-(tagId). And a TagsToItems S3 bucket inside which you put empty objects named (ownerIf)-(tagId)-(itemId), it would then be possible to use the Listing Keys Hierarchically using Prefix and Delimiter method of accessing your S3 buckets to quickly determine what items belong to a tag for an owner, and what tags belong to an tag for an owner. You would be taking advantage of the fact that that There is no limit to the number of objects that one bucket can hold, and no impact on performance when using many buckets versus just a few buckets. You could reasonably store all of your objects in a single bucket, or organize them across several different buckets. (both the above links are to quotes taken directly from the S3 API docs provided by amazon themselves)

Using this method it would be possible, I think, to use the S3 datastore in a VERY cheap manner and avoid having to deal with the massive cost of maintaining these kinds of indexes in a RDBMS or on your own filesystems… Interesting. And since the data could be *anything* and you have, by default you have a many to many relationship here you could theoretically store *anything* and sort by tags…

Granted to find a tag related to multiple items you would have to make multiple requests, and weed out the diffs. but. if you’re only talking on the order of 2 or 3 tages per piece of data… it might just be feasible..

Now… Throw in an EC2 front end, and a SQS interface… interesting…

Makes me wonder what the cost and speed would be (if it would be an acceptable tradeoff for not having to maintain a massive database cluster)

Disclaimer: this is a random musing. I’m not advising that anybody actually do this…

How S3 Fits in Comparison to Other Storage Solutions

Random Thoughts | Posted by apokalyptik
Oct 10 2006

So, recently Nick G. Asked “Since you’ve worked with S3 a good bit, I’d like to get your take on using a service like S3 compared to using a local instance (or cluster) of MogileFS?”

I’d like to interject here and mention that in this case “quite a bit” means I’ve used it in one application for data backup, at an early stage during which there was no good example (much less released interface) for using S3 with PHP code. So I wrote and distributed my own. I’m sure that it’s fallen into disuse and more active projects are likely to be favored. So thats my “lots of experience”. Always take what I have to say with the appropriate amount of salt

Any my answer would be that each type of storage solution listed has both strengths and weaknesses, and determining which set best compliments your application needs will tell you where you should invest. I would also throw another option into the pot: the SAN. While a SAN might not be in the range of your average garage tinkerer it *is* in the range of medium or large startups with proper funding. I do however believe the question was geared more towards a slant on an “each of these versus S3″ analysis, so thats how I’ll approach the question.

But first… Let me get this out of the way. S3 looses, by default, if you absolutely cannot live without block device access. Do not pass go, do not collect $200. It’s a weakness and you’ll have to be willing to accept it for what it is.

S3 Vs. SAN (Storage Area Network)

Your most tried and true contender in the mass storage market is probably the SAN. Those refrigerator sized boxes which sit in warehouse sized data-centers thirstily consuming vast amounts of electricity, and pushing the bits through slender delicate orange fiber-optic cables. Your basic sales pitch surrounding any SAN these days are comprised of the same points in varying degree:

Expandability: No modern SAN would be complete without the promise of expandable storage. On a quick and dirty level a SAN has a bunch of disks which it slices and dices into pieces and then glues those pieces together into a “logical unit”. So many many hard drives become just one hard drive. However keep in mind that you have to use a filesystem which supports dynamic expansion, and you almost always have to dismount the volume to accomplish it to boot.

Backup: At a small cost of “twice what you were planning to pay plus $30,000 for the software” you ought to be able to, with any modern SAN, preform realtime on-the-fly backups. I would throw in negative commentary here, but I think the sales pitch bares its negative connotations in a fairly self evident manner.

Clustering: You can have multiple machines accessing the same hard drive! Which is great as long as you can setup and use a clustering filesystem. What they fail to tell you is that using a *normal* non-cluster aware FS will get you nothing but massive data corruption. So unless you plan on using some cookie cutter type system for accessing the storage, and are planning on spending big bucks on having it built for you… the clustering is going to be less than useful. Also you cannot run multiple MySQL database instances on the same part of a shared disk like that, so get that idea out of your head too (disclaimer: I know not if allowing MySQL access to the raw partition fares any better in this case, but I somehow doubt it).

High availability/integrity: So long as you buy a bunch of extra hard drives for the machine you can expect to handle failures of individual disks gracefully. That is if the term gracefully includes running at 25% slower for a couple of hours while bits get shifted around… and then again when the broken drive is replaced… But, no, you wont loose your data

Speed: Yea… SAN’s are fricken fast… no doubt… SAN’s usually function on a dedicated fiber-optic network (afore mentioned delicate orange cables) so a) they don’t saturate your IP network, and B) aren’t limited to its speed

So how does S3 stack up against the SAN? Well, lets see… Expandability: S3 has a SAN beat hands down with not only implied expandability but also with implied constriction, S3 you pay for what you use.

Backup: Amazon guarantees data retention, no need to pay extra. Clustering: again, covered, providing that you have built your application to play nice in some way there is no problem here.

High Availability and Integrity: Here there is more of a tradeoff since a SAN is a guaranteed write and then immediately be available, and S3 is a write once, eventually stored. One of the hurdles with S3 is that it may take a while (an unknown period of time) for a file stored in S3 to become available globally, making it less than ideal to, say, host html generated by your CMS — thats not to say that its impossible, but there may be an indeterminate period when you have a page linked to and only half your viewers can access it (you would think you could get around this by storing the data first an then the index last, but there is no guarantee that the order in which items are sent is the order in which they will become available.)

And finally Speed: Here the SAN wins out — you pay for bandwidth to connect to amazon’s S3 service, and you cant, and wouldn’t want, to pay the bills for a sustained multi-gigabit per second connection to S3 (ouch)

Therefor: If you can handle A) a small time-to-availability, B) non-block-access, and C) a speed limited by the public internet connection. Then S3 is probably a better choice. But for the total package… if you have the resources… the SAN is irreplaceable.

S3 Vs. NAS (Network Attached Storage)

The NAS is like the SAN’s little brother. They tend to function much the same as a SAN, but are usually A) put on the IP network which can cause saturation and limits speed, B) are usually not as robust in the H.A. and Data integrity department, C) have a lower cap on their ultimate expandability, and D) cost a whole hell of a lot less than a SAN.

So the NAS has carved out a well deserved niche in small business and some home offices because it provides a bunch of local storage at a much more reasonable price. We, therefor, cannot evaluate its pros and cons on the same points as we did the SAN. NAS are often used to store large files locally in a shared manner. Many clients mount the shared volume and are able to work collaboratively on the files stored there. And for this reason S3 is not even thinking about encroaching on the NAS space. First off a home DSL working on a 100MB CAD file is not feasible in the same way that it is on a NAS. It would be an awful thing to wait for 100MB to save at 12Kb/sec – Period. Also the idea of using a multi-user accounting software to have two accountants in the records at the same time is basically impossible…

If you’re thinking about the NAS in a data-center type environment, I’m going to consider it lumped in with either the homegrown cluster solution (small NAS) or the SAN (large NAS)

So if you need a NAS… stick with a NAS. HOWEVER consider S3 as a very convenient, effective, and affordable alternative to something like tape based backup solutions for this data.

S3 Vs. HomeGrown Cluster Storage

The home grown clustering solution is an interesting one to tackle. NFS servers, or distributed filesystems (with or without local caching), or samba servers, or netware servers, all with-or-without some sort of redundancy built in, and all with varying levels of support attached. And thats your biggest challenge in this space: finding support.

You will have to build your application to take into account the eccentricities of the particular storage medium (varying level of POSIX support, for example) but knowing what those quirks *are* will save you time frustration and gobs of money later on. Because if you’re using some random duct-taped-solution thats been all Mac’d out it will probably do the trick — but what happens if the guy who designed it (and thus knows how all the pieces fit together) leaves the company or gets hit by a bus? well… you’re probably out of luck. But with S3 you have a very large pool of people all rallying around one solution with one (ok or two) access methods and it simply is what it is.

There are really no surprises with S3 which is the first reason that it beats out the custom tricked out storage solution. The second reason is that there is no assembly required — except maybe figuring out which access library to use. No assembly means no administration. No administration means better code. Better code means getting sold like a hot video sharing company. Well… One can dream

S3 Vs. Local Storage

Aside from the obvious block access, and up-to-scsi speeds that local storage provides it looses to S3 in almost every way.

It’s not expandable very far. It’s not very fail-safe. It’s not distributed. It requires some form of backup. It requires, power, cooling, room, and physical administration. My advice: if you can skip the hard drive you SHOULD.

S3 Vs. MogileFs

MogileFS is an interesting comer in this particular exercise-of-thought. It’s a kind of hybrid between the grow-your-own cluster and the local storage bit. It offers an intriguing combination of pro’s vs con’s, and is probably the most apples-to-apples comparison that can be made with S3 at all. Which makes me wish I’d had more of a chance to use it.

But the basic premise is that you have a distributed system, which is easily expandable, and handles data redundancy. My understanding is that you classify certain portions of the storage with a certain number of redundant copies required to be considered safe, and the data is stored on that many different nodes. When you make a request for the file you are returned a link in such a way as is meant to distribute the read load among the various servers housing the data. You also have a built in fail-safe for a node going down and shouldn’t be handed a link to a file on a downed node.

So what does all that mean? Well if you went about trying to build yourself a non-authenticated in-house version of Amazon’s S3 service you would probably end up with something that is remarkably similar to MogileFS. I wouldn’t even be surprised to find out that S3 is modeled after Mogile. What’s more Mogile has a proven track record when it comes to serving files for web based applications.

So how do they actually compare? I would say that for a company deciding whether or not to use Mogile Versus S3 it comes down to a couple of key factors. A) source and destination of traffic, B) type of files being distributed, and C) up front investment.

As far as your traffic. If you’re planning on using Mogile primarily internally and data will rarely leave the LAN then you will not be paying for bandwidth costs associated with S3. That makes for a pretty simple solution. If you are distributing the files to a global audience, however, you might find that using S3 to pay for bandwidth costs along with handling local availability, delivery speed, and high availability is a win. However I’d be fairly inclined to guarantee (as I’ve covered before) that the raw bandwidth purchased from your ISP is a lot cheaper than from Amazon AWS, so long as you already have all the necessary equipment in place for redundancy, delivery, etc, Mogiles advantages brings it within striking distance of S3.

If you are distributing primarily small files (images, etc) then mogile is not going to present to you any challenges. If, however, you are serving 100MB video files or 650MB CD images Mogile might actually work against you. When I tried to use Mog for this kind of an application there was a limit on the size of an individual file that it was willing to transfer between hosts. In this respect Mog broke its own replication. DISCLAIMER: I only spent a week or so total with Mog (broken up into hour here and hour there sessions) this might have been a) known, or b) easily worked around, but my quick googling at the time yielded little help. The idea of having to split large files was a deal breaker at the time and other things were pressing for my attention.

And the real thing that Mog does require which S3 does not is a hardware and manpower investment. Since you’re going to have to work your application in a similar manner to house data in either S3 or in MogileFS, S3 wins out on sheer ease of setup… All you have to do is signup for an AWS account, pop in a credit card number, and you’re on your way. That same hour. You also don’t run out of space with S3 like you can with Mog, granted Mog can be easily expanded — but you have to put more hardware into it. S3 is simply already as large as you need it to be.

Summary

In the end what these choices always come down to is some combination of the classic triangle: Time Vs. Money Vs. Manpower. And what storage is right for you depends on how much of each you are willing to commit. Something always has to give. The main advantage of S3 is tat you’re borrowing on the fact that Amazon has already committed a lot of time and hardware resources which you can leverage if the shoe fits.

More than likely what you’ll find is that the “fit” os something like S3 will be a seasonal thing. When you start out developing your application and you don’t have resources to throw at it using S3 for your storage will make a lot of sense because you can avoid the whole issue of capacity planning and purchasing hardware with storage in mind. Then you will probably move into a quasi-funded mode where it is starting to, or outright gets, too expensive to use S3 versus hiring a an admin and throwing a couple of servers in a data-center. And then you might just come back full circle to a point when you’re drowning in physical administration, and spending a little extra for ease of use and peace of mind comes back into style.

So which is right for you? Probably all of the above, just at different times, for different uses, and for different reasons. The key to your success will likely lie in your ability to plan for when and where each type of storage is right. And to already have a path in mind for when it’s time to migrate.

Where should AmazonAWS go next?

Amazon AWS, Business, Linux, MySQL, Random Thoughts, Software Development, Web Stuff | Posted by apokalyptik
Oct 10 2006

We have SQS, we have S3, and we have EC2, so what next from the Amazon AWS team?.

There is really only one piece of the puzzle missing… And its a piece that has a lot of people griping. I have a strong hunch that Amazon is working on the problem, because I have a strong hunch that it is (or was) one of their major hurdles. And that problem is the database service.

How do you provide an easy to use interface to relational lookup-able storage? How do you make it universal? How do you make it secure? How do you make it FAST?

The first 3 questions are all answerable in roughly the same way: Make it a service, and let the service handle the interface, security, and universality. They’ve sucessfully applied the web-service to messaging, storage, and cpu power, theres no reason that this wouldnt be the final piece to the jigsaw puzzle. The last question carries with it the greatest problem, though. Allowing people to store data and run queries without the innevitable tanking of the server process would be a challange, to say the least (artificial intelligence is no match for human stupidity, after all).

But thats besides the point. If you break down into two components: anchors and tags — that is something is data or something is data about the data. provide a schema that works without collision problems, and – more importantly works both ways (finding tags related to an anchor, AND finding anchors relating to a tag) you cover probably 90% of peoples needs in one fell swoop.

I’ve been thinking a lot about how to do this, lately, as I’ve been drowning in a sea of data myself which is easy to manage in one direction but difficult in the other while keeping the size of the whole thing down.

Not only would that provide Amazon with the ability to have its finger in basically every new technological cookie jar BUT would provide huge massive gigantic enormous amounts of datas on what people really think about things. It would be an exceptional win for amazon, I think, and could indeed be leveraged to a huge advantage in the marketplace market. Because, as netflix has shown us recently, reliably finding things which relate to other things is *big* business.