tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.

DB:/users

DB:/items

DB:/tags

becomes

DBTags:/tags

DBUsers:/users

DBItems:/items

And then

DBUsers:/users

Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.

DBItems:/

  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size

DBTags:/

  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…

Ec2a

  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)

Ec2b

  • items 2-3 (w)
  • tags 2-3 (w)

Ec2c

  • items 4-5 (w)
  • tags 4-5 (w)

Ec2d

  • items 6-7 (w)
  • tags 6-7 (w)

Ec2e

  • items 8-9 (w)
  • tags 8-9 (w)

Ec2f

  • items 0-1 (r)
  • tags 0-1 (r)

Ec2g

  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)

Ec2h

  • items 4-5 (r)
  • tags 4-5 (r)

Ec2i

  • items 6-7 (r)
  • tags 6-7 (r)

Ec2j

  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

Dear every site

With the hoops I had to jump through to make a darned account on your site to be able to do anything at all, even only for a moment… One might get the impression that you actually don’t want people to sign up and use your service.  What a horrendous chain of data entry, waiting, verifying.  Not to mention that I’ve already done this… over… and over… and over… ad over… and over… and over again at every other site on the planet. Times are a changing, and in a year or two this process will loose you customers. It might already be.  Just throwing in my $o.02

Tell you what… I’ll commit with a checkbox, to clicking on an advertisement during each of my visits.  And if I dont then you can require all that information from me. I don’t mind you making money by my being here, but I do mind having to regurgitate this information yet again

DK

Actually… iTerm *and* negatives in speech :)

Marc – I did manage to get it behaving MOSTLY the way I want it to… I’ll post more on that later tonight or tomorrow ( hopefully 🙂 )

Bon – you’re right I *did* mean that I wouldn’t have it any other way.  I consider English my 4th or 5th language behind things like bash, php, and ruby :)  Such is the way of these things, I suppose, eh? 😀

I give up on OSX’s terminal.app

Lets face it. If you rely on vim locally, vim remotely on non debian based systems, and vim remotely on Debian based systems. ESPECIALLY in an environment where you don’t get to put in your own .vimrc (such as when logged in as root, and other people log in as root as well.) You’re pretty much screwed. Now I tried all the hacks. The custom strings in preferences. I tried replacement terminals (terminator, iterm, and one or two others) And I could get things to work locally but not remotely, or remotely but not locally… So… I… Well… I gave up. I:

  • Installed X11 from my Install DVD
  • Ran X11Update2006.pkg
  • Launched X11
  • Selected Applications –> Customize Menu
  • Changed my “Terminal” command to “xterm -fa Monaco -fs 12 -bg black -fg grey -sb -sl 99999 -cr green -bc”
  • symlinked my .profile to .bashrc

And… “the dishes are done, dude” This takes care of *basically* everything.. with two minor exceptions… forward delete doesn’t work well on the command line, and pasting is somewhat awkward. but everything else… home, end, pgup, pgdn, l,r,u,d, all work everywhere I’ve tried so far. I can put up with some things to work around… as long as they are consistent things to ALWAYS work around… not things that work one way here, another over there, and differently in a third place… That bites.

And in case you’re curious… here’s my .profile (.bashrc)… without all the personal aliases…

function tranquil_thoughts_prompt
{
local WHITE="\[\033[1;37m\]"
local BRIGHTGREEN="\[\033[1;32m\]"
local GREEN="\[\033[0;32m\]"
local CYAN="\[\033[0;36m\]"
local GRAY="\[\033[0;37m\]"
local BRIGHTCYAN="\[\033[1;36m\]"
export PS1="${GRAY}.o${WHITE}O(${BRIGHTCYAN}\u${GRAY}@${BRIGHTGREEN}\H ${CYAN}\w${WHITE})${GRAY} "
}

export EDITOR="vim";
export PROMPT_COMMAND='echo -ne "\033]0;${USER}@localhost - ${PWD}\007"'
tranquil_thoughts_prompt
export PATH=/opt/local/bin:/opt/local/sbin:$PATH:$EC2_HOME/bin:~/bin

(The export PS1 line above is all one line from export to } “

Whoa, talk about neglecting your weblog! Bad Form!

I know, I know, I’ve been silent for quite some time. Well Let me assure you that I’m quite all right! Are you less worried about me now? Oh good. (Yes I’m a cynical bastage sometimes.)

So life has, as it tends to do, come at me pretty fast. I’ve left my previous employer, Ookles, and I wish them all the best in accomplishing everything that they’ve been working towards. So I’ve Joined up with the very smart, very cool guys at Automattic. I have to tell you I’m excited to be working with these guys, they’re truly a great group.

I guess that means I’m… kind of… like… obligated to keep up on my blog now, eh?  I’m also kind of, like, ehausted. Jumping feet first into large projects has a tendency to do that to a guy though.  And truth be told I would have it any other way…

😀

Cheers

DK