Random Musing: Bluring the Line Between Storage and Database?

As food for thought…

If you had a table `items`

  • itemId char(40),
  • itemName varchar(128),

Another table `tags`

  • tagId char(40),
  • tagName char(40),

And a third table `owners`

  • ownerId char(40),
  • ownerUsername char(40),
  • ownerPassword varchar(128),

It would theoretically be possible to have an S3 bucket ItemsToTags inside which you put empty objects named (ownerId)-(itemId)-(tagId). And a TagsToItems S3 bucket inside which you put empty objects named (ownerIf)-(tagId)-(itemId), it would then be possible to use the Listing Keys Hierarchically using Prefix and Delimiter method of accessing your S3 buckets to quickly determine what items belong to a tag for an owner, and what tags belong to an tag for an owner. You would be taking advantage of the fact that that There is no limit to the number of objects that one bucket can hold, and no impact on performance when using many buckets versus just a few buckets. You could reasonably store all of your objects in a single bucket, or organize them across several different buckets. (both the above links are to quotes taken directly from the S3 API docs provided by amazon themselves)

Using this method it would be possible, I think, to use the S3 datastore in a VERY cheap manner and avoid having to deal with the massive cost of maintaining these kinds of indexes in a RDBMS or on your own filesystems… Interesting. And since the data could be *anything* and you have, by default you have a many to many relationship here you could theoretically store *anything* and sort by tags…

Granted to find a tag related to multiple items you would have to make multiple requests, and weed out the diffs. but. if you’re only talking on the order of 2 or 3 tages per piece of data… it might just be feasible..

Now… Throw in an EC2 front end, and a SQS interface… interesting…

Makes me wonder what the cost and speed would be (if it would be an acceptable tradeoff for not having to maintain a massive database cluster)

Disclaimer: this is a random musing. I’m not advising that anybody actually do this…

4 thoughts on “Random Musing: Bluring the Line Between Storage and Database?

  1. Or you could just stick your data in Google Base, and get a rich query language, plus reliable datastore for free.

    Obviously it's not going to be suitable for all data, but for some very specific use cases I'm convinced it would work.

    (Nice blog, BTW)

  2. true, BUT, you'll be stuck with the gbase 250,000 item limit (assuming the max-results value times the start-index value equal the total possible number of results which tis possible to fetch on a certain criteria). This method would, theoretically, have no limit on the number of items on which you could pull in a single query.

Leave a Reply