DivShare, Day 1 (raw commentary)

I began looking at divshare a few days ago as a way to stor, save, and share my personal photo collection.  The idea of auto-galleries, unlimited space, flash video, and possible FTP access was… enticing.  But it’s tough to tell how something like this is going to work on a large scale…

So… after messing around with a free divshare account for a while I decided it was more worth my while to pay 10 bucks for a pro account and get FTP access than to try and use mechanize (or something similar) to hack out my own makeshift API.  Now I have about… Oh… 8,000 files I want to upload… So… doing that 10 at a time was just _NOT_ going to happen…

After paying for a pro account I was *immediately* granted FTP access, no waiting. And for that I was grateful.  Since I take photos at 6MP, and thats WAY too large for most online uses I have a shell script which automagically creates 5%, 10%, and 25% or original sized thumbnails.  This meant that I had an expansive set of files I could upload and only take a couple of hours doing it (5% thumbs end up being less than 200Mb.)  This, I thought, would be an excellent test of their interfaces.

So an-ftping-i-a-go.  Upload all my files into a sub directory (005). Visit the dash. nothing. Visit the ftp-upload-page to recheck… maybe I did something wrong. AND WHAM! an 8,000 check box form to accept ftp uploaded files… ugh.  Thankfully they’re all checked by default.  I let it load (for a long while) and hit submit… and wait… and wait… and wait.  Then the server side connection times out a while later.  Fair enough. Check my dash… about 1500 of the 8,000 photos were imported… I’m going to have to do this 6 times. Annoying, but doable.  Hit the second submit, and pop open another browser to look at my dash.  And divshare did *nothing* with my folder name… that wasnt translated to a “virtual” folder at all. tsk tsk.

So I need to put about 1500 photo, manually, into an 005 folder… and then I realize… I have to do this 20 files at a time… with no way to just show files that are not currently in a folder.

… uh no …

Ok, so I open up one of the photos that I DID put into the 005 folder, and it did, in fact, make them into a “gallery” of sorts. It made a thumbnail , and displayed all 3,000 photos side by side in something similar to an iframe… no rows. just one row… 3,000 columns… and waiting as my browser requests each… and every… thumb… from divshare. Wonderful.  The gallery controls are simple enough an iframe with a scrollbar at the bottom, a next photo link, and a previous photo link.  And all 3 controls make you loose your place in the iframe when you use them…

Now dont get me wrong. You get what you pay for. But hey… I did pay this time ;)  The service is excellent for what it does. And my use case was a bit extreme. Still I hope that they address these issues that I’ve pointed out.  I’d really like to continue using them, and if they can make my pohoto process easier I’ll gladly keep paying them $10/mo

Thats

  1. Don’t ignore what pro users are telling you when they upload
  2. Process large-accepts in the background, let me know I need to come back later
  3. Negative searching (folder == nil)
  4. Mass file controls (Iether items/page, or all-items-in-view (folder == nil))
  5. Give me a gallery a non-broadband user can use (1500 thumbs in one sitting tastes bad, more filling)
  6. Don’t undo what I’ve done in the gallery every click.  Finding your place among 8,000 photos is tedious to do once

And I know I sound like I’m just complaining. And I am. But this is web 2.0 feedback baby. Ignore my grouchiness, and (If I’m lucky) take my suggestions and run with them asap.  The photo/files market is very very far from cornered!

Most people wont care…

Us web 2.0 and web 3.0 people have a hard time caring about the things that normal people care about. And we have a hard time believing that people don’t care about the things that we do.  In short we’re a large group of very detached individuals who are, more or less, free to form ideas into substance in the vacuum of our own creation.

I often have a hard time coming to grips with this concept myself. WHAT DO YOU MEAN nobody will care about this idea?! It’s great.  But after a while chewing on that, I’ll grudgingly admit that while it may be a great idea… Almost nobody will care.

So when I saw, a few days ago, a bit of a fuss being kicked up over google wanting your browsing history. I surprised myself by offhandedly thinking: “nobody but us cares.” And I still think that.  As a matter of fact I think that in a utilitarian sense most everybody will embrace the idea.

The problem is in search.  Google has taken keyword search straight to the edge.  And now people are hungering for the next search. Search 4.5 beta.  And that’s relevancy.  I’m a dog lover (I have 3 large dogs) so let me give you an example from my world.

Lets assume I just got a new pupy and she’s SUPER submissive. Peeing all over, shakes, just scared.  If I go to google and type “submissive bitch”… I don’t get what I was looking for.  Now if google has my browser history and sees that I frequent the Chazhound Dog Forums now google has the information necessary to determine that I’m not looking for sex, but in fact dog related topics.

This is why, not only will they not care but, most people will embrace giving google more data.  Sure I care. You care. But lets not fool ourselves into thinking that everybody else cares too 🙂

tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.

DB:/users

DB:/items

DB:/tags

becomes

DBTags:/tags

DBUsers:/users

DBItems:/items

And then

DBUsers:/users

Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.

DBItems:/

  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size

DBTags:/

  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…

Ec2a

  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)

Ec2b

  • items 2-3 (w)
  • tags 2-3 (w)

Ec2c

  • items 4-5 (w)
  • tags 4-5 (w)

Ec2d

  • items 6-7 (w)
  • tags 6-7 (w)

Ec2e

  • items 8-9 (w)
  • tags 8-9 (w)

Ec2f

  • items 0-1 (r)
  • tags 0-1 (r)

Ec2g

  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)

Ec2h

  • items 4-5 (r)
  • tags 4-5 (r)

Ec2i

  • items 6-7 (r)
  • tags 6-7 (r)

Ec2j

  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

Whoa, talk about neglecting your weblog! Bad Form!

I know, I know, I’ve been silent for quite some time. Well Let me assure you that I’m quite all right! Are you less worried about me now? Oh good. (Yes I’m a cynical bastage sometimes.)

So life has, as it tends to do, come at me pretty fast. I’ve left my previous employer, Ookles, and I wish them all the best in accomplishing everything that they’ve been working towards. So I’ve Joined up with the very smart, very cool guys at Automattic. I have to tell you I’m excited to be working with these guys, they’re truly a great group.

I guess that means I’m… kind of… like… obligated to keep up on my blog now, eh?  I’m also kind of, like, ehausted. Jumping feet first into large projects has a tendency to do that to a guy though.  And truth be told I would have it any other way…

😀

Cheers

DK

ruby-Delicious v0.001

Since I’ve worked out the kinks mentioned in my last blog entry (was a problem with re-escaping already escaped data, by the way (never debug while sick and sleep deprived!)) I’ve scraped things together into a class which is a client for the api itself.  It’s relatively sparse right now, but good enough for using in an application.  Which is what the client is geared towards, by the way.  Specifically (and privately) tagging arbitrary data.  It *can* publicly tag URLs, but thats more or less a side effect if what delicious… is… and not a direct intention while writing the API.  You can visit the quickly thrown together ruby-Delicious page here (link also added up top)

Kudos to the openfount guys

I’m really very impressed with the speed at which the Openfount guys responded to my last post. I definitely give Kudos to Bill for being on top of things! I’m running out the door so I’ll keep this short and sweet.

He’s right, I did generalize databases into InnoDB, but thats because it’s what I use. So my apologies for that.
I definitely had no intention of badmouthing the Openfount guys (if thats what it sounded like I did, I apologize) Just reporting what I saw, and my impressions.

The Bill – I would have either used

  • apokalyptik
  • at
  • apokalyptik
  • dot
  • com

or

  • consult
  • at
  • apokalyptik
  • dot
  • com

or

  • demitrious
  • dot
  • kelly
  • at
  • gmail
  • dot
  • com

Infinidisk Update

I mentioned a while back that I was going to be playing with the S3 Infinidisk Product.  What I found in my testing was that this product is not prime time ready.  There was a nasty bug which caused data to be lost if the mv command was used. The scripts themselves were unintuitive.  They required fancy-pants nohupping or screening to use long term.  Oh, and a database definitely will not work on top of this FS. It seems obvious in retrospect, but I wanted to be sure.  InnoDB wont even build its initial files much less operated on the FS.  To top it all off, My pre-sales support question was never even so much as acknowledged.

No, I think I’ll be leaving this product alone for now and sticking with clever uses of s3sync and s3cmd, thanks.

Consolidated update of no real importance

I’ve been working very hard (and very constantly) lately, and as is the case with most technical creators this means that the blog has suffered a lack of posts recently.  This is an attempt not to make up for that, but to fill in the gap with a bit of noise.  I like to think of my blog as a high “signal to noise” ratio kinda place, but desperate times call for desperate measures, right.  I don’t want you all to think I’ve forgotten you.  Actually the first thing I’ve got it solely about you, “the reader” (whoever you are.)

Does anybody know of a WP plugin which allows users to suggest topics.  I love writing, and thinking about problems, but often times I’m far too busy solving lots of problems that I can’t talk about to come up with all new problems to talk about 🙂 (damn NDA’s.)  I would like to encourage a steady stream of ideas from readers, and potential readers.  That doesn’t mean I would actually tackle everything suggested, but it would definitely help me out for those times when my “muse” has left me.

It strikes me that a shout box might be exactly what the doctor ordered here. I’ll try and remember later on to look for one. (preferably with some sort of history, or even an e-mail notification function.)  If those aren’t extant building them in would be pretty easy, I’ll try and pick a decent SB and contribute a patch back.

Another reason for my lack of signal lately is that I’ve finally manage to contract one of the contagious illnesses that my wife brings home. It’s not her fault, mind you, shes an educational professional working with small children so the random viruses flying around are a part of the package.  On the flip side she gets to put up with me being awak at all hours of the night when I get inspiration on how to solve some problem she doesn’t care in the slightest about. So it’s a two way street here :D  But this one has really sunk its teeth into me.  Normally I can work through colds, and the like, (I’ve even been known to work through flu’s without so much as a complaint)  But this time combination with other pressures in my life gave it an opening.  It took full advantage.  Phew!