Most people wont care…

Us web 2.0 and web 3.0 people have a hard time caring about the things that normal people care about. And we have a hard time believing that people don’t care about the things that we do.  In short we’re a large group of very detached individuals who are, more or less, free to form ideas into substance in the vacuum of our own creation.

I often have a hard time coming to grips with this concept myself. WHAT DO YOU MEAN nobody will care about this idea?! It’s great.  But after a while chewing on that, I’ll grudgingly admit that while it may be a great idea… Almost nobody will care.

So when I saw, a few days ago, a bit of a fuss being kicked up over google wanting your browsing history. I surprised myself by offhandedly thinking: “nobody but us cares.” And I still think that.  As a matter of fact I think that in a utilitarian sense most everybody will embrace the idea.

The problem is in search.  Google has taken keyword search straight to the edge.  And now people are hungering for the next search. Search 4.5 beta.  And that’s relevancy.  I’m a dog lover (I have 3 large dogs) so let me give you an example from my world.

Lets assume I just got a new pupy and she’s SUPER submissive. Peeing all over, shakes, just scared.  If I go to google and type “submissive bitch”… I don’t get what I was looking for.  Now if google has my browser history and sees that I frequent the Chazhound Dog Forums now google has the information necessary to determine that I’m not looking for sex, but in fact dog related topics.

This is why, not only will they not care but, most people will embrace giving google more data.  Sure I care. You care. But lets not fool ourselves into thinking that everybody else cares too 🙂

tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.

DB:/users

DB:/items

DB:/tags

becomes

DBTags:/tags

DBUsers:/users

DBItems:/items

And then

DBUsers:/users

Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.

DBItems:/

  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size

DBTags:/

  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…

Ec2a

  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)

Ec2b

  • items 2-3 (w)
  • tags 2-3 (w)

Ec2c

  • items 4-5 (w)
  • tags 4-5 (w)

Ec2d

  • items 6-7 (w)
  • tags 6-7 (w)

Ec2e

  • items 8-9 (w)
  • tags 8-9 (w)

Ec2f

  • items 0-1 (r)
  • tags 0-1 (r)

Ec2g

  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)

Ec2h

  • items 4-5 (r)
  • tags 4-5 (r)

Ec2i

  • items 6-7 (r)
  • tags 6-7 (r)

Ec2j

  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

Whoa, talk about neglecting your weblog! Bad Form!

I know, I know, I’ve been silent for quite some time. Well Let me assure you that I’m quite all right! Are you less worried about me now? Oh good. (Yes I’m a cynical bastage sometimes.)

So life has, as it tends to do, come at me pretty fast. I’ve left my previous employer, Ookles, and I wish them all the best in accomplishing everything that they’ve been working towards. So I’ve Joined up with the very smart, very cool guys at Automattic. I have to tell you I’m excited to be working with these guys, they’re truly a great group.

I guess that means I’m… kind of… like… obligated to keep up on my blog now, eh?  I’m also kind of, like, ehausted. Jumping feet first into large projects has a tendency to do that to a guy though.  And truth be told I would have it any other way…

😀

Cheers

DK

Hpricot <text>sometext</text> workaround

As noted by the open trouble ticket here, The most awesome Hpricot seems to have come down with a bug, in that it’s not able to access “sometext” inside this: “<text>sometext</text>” It parses it ok (puts.doc.inspect definately shows the proper {elem}) you just cant get to it. So heres my ugly little hack/workaround for this issue until it’s resolved. (I’m posting it here, since I cant seem to signup to make a comment on the bug report on the Hpricot home page… and someone might find this useful) This hack is specifically for web documents, however would also work for strings or files with only minor tweaks.

## Begin hack
doc = “”
open(url) do |f|
doc=doc + f.read
end
doc = doc.gsub(/<text>/, “<mtext>”)
doc = doc.gsub(/<\/text>/, “</mtext>”)
doc = Hpricot(doc)
## Should be one line
## doc = Hpricot(open(url))
## End hack

ruby-Delicious v0.001

Since I’ve worked out the kinks mentioned in my last blog entry (was a problem with re-escaping already escaped data, by the way (never debug while sick and sleep deprived!)) I’ve scraped things together into a class which is a client for the api itself.  It’s relatively sparse right now, but good enough for using in an application.  Which is what the client is geared towards, by the way.  Specifically (and privately) tagging arbitrary data.  It *can* publicly tag URLs, but thats more or less a side effect if what delicious… is… and not a direct intention while writing the API.  You can visit the quickly thrown together ruby-Delicious page here (link also added up top)

Having a strange problem with the del.icio.us api

(note: example urls broken because of line width, they’re each on one line in my code :D)

I’m using code referenced here: http://www.bigbold.com/snippets/posts/show/2431 to access the del.* tagging api with only limited success. I dont think it’s the code, though, because the problem is replicatable in the browser, and everything *seems* to line up with the docs.. The URL I use to create the item is:

https://api.del.icio.us/v1/posts

/add?&url=la+la+la&description=foo&tags=foo

This works. I get a nice “foo” with the proper url “la la la” and I get a pretty <result code=”done”/> Then I try to delete the item with iether of these urls:

https://api.del.icio.us/v1/posts
/delete?&url=la+la+la

https://api.del.icio.us/v1/posts
/delete?&url=foo

Niether of these work, I still get a pretty <result code=”done”/>, but the item is never deleted…

I saw this problem referenced on the Tucows Farm blog but the only suggestion in the comments was: “Google for “delicious-import.pl”, it deletes bookmarks upto 100 at a time. A quick little override in the code will make it delete all bookmarks. Handy when you screw up an import. Not so handy in other situations, which is why you cant do it by default. This script will read a netscape/firefox/mozilla type bookmark file. I am re-working it to do Opera for me.” Which I did, the URL build inside of the script there is http://user:[email protected]/api/posts/delete?&url=la+la+la but that’s out of date. I tried it anyhow, andit redirected me to the /v1/ query above (https://api.del.icio.us/v1/posts/delete?&url=la+la+la) Which still didnt work. I can’t imagine that I’m the only person who’s run into this problem

Tag anything, anywhere?

I’ve not been able to find anything really high profile (good google page rank) but is there an API which allows you to tag *anything* anywhere? (not just URLS, but… any piece of data?) Being able to take one arbitrary identifier, optionally a type, and add arbitrary tags to it sounds like the stuff of web 2.0, yea? but seems people are just home-brewing their own. Now if I were able to go somewhere and /tags/people/demitrious or /tags/blogs/demitrious or /tags/*/demitrious or /tags/urls/apokalyptik.com /tags/foo/bar then we’d be getting somewhere