Archive for the ‘Personal’ Category:
When simple plans attack!
Well at the zoo we lost a pair of glasses, so off to lensecrafters… no eye exam in 3 years… so eye exam… 2 hors for glasses… and WHAM a day thats supposed to end at 4:00pm ends at 9:00pm… sigh…
we now return you to your regularly scheduled blogging.
nasty regex
I’m putting this here for documentation purposes… Because getting it right was a very frustrating ordeal (I’d never had to match both positively and negatively in the same regex before)
/^(?(?!.+\.php)^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)|^$)/s
what this is, essentially, saying is “true if the string doesnt match ^.+\.php and the string matches ^.*(\.jpg|\.jpeg|\.gif|\.ico|\.png)” The last bit: “|^$” never returns true in my case,because we’re matching on URI’s which are always at least one character long ( “/” )
All things being equal, the simplest solution tends to be the best one.
Occam’s razor strikes again!
Tonight we ran into an interesting problem. A web service - with a very simple time-elapsed check - started reporting negatives… Racking our brain, pouring over the code, produced nothing. It was as if the clock were jumping around randomly! No! On a whim Barry checked it and the clock was, indeed, jumping around…
# while [ 1 ]; do date; sleep 1; done Wed May 30 04:37:52 UTC 2007 Wed May 30 04:37:53 UTC 2007 Wed May 30 04:37:54 UTC 2007 Wed May 30 04:37:55 UTC 2007 Wed May 30 04:37:56 UTC 2007 Wed May 30 04:37:57 UTC 2007 Wed May 30 04:37:58 UTC 2007 Wed May 30 04:37:59 UTC 2007 Wed May 30 04:38:00 UTC 2007 Wed May 30 04:38:01 UTC 2007 Wed May 30 04:38:02 UTC 2007 Wed May 30 04:38:19 UTC 2007 Wed May 30 04:38:21 UTC 2007 Wed May 30 04:38:22 UTC 2007 Wed May 30 04:38:23 UTC 2007 Wed May 30 04:38:24 UTC 2007 Wed May 30 04:38:08 UTC 2007 Wed May 30 04:38:09 UTC 2007 Wed May 30 04:38:10 UTC 2007 Wed May 30 04:38:28 UTC 2007 Wed May 30 04:38:12 UTC 2007 Wed May 30 04:38:30 UTC 2007 Wed May 30 04:38:31 UTC 2007 Wed May 30 04:38:32 UTC 2007 Wed May 30 04:38:33 UTC 2007 Wed May 30 04:38:34 UTC 2007 Wed May 30 04:38:35 UTC 2007 Wed May 30 04:38:19 UTC 2007 Wed May 30 04:38:20 UTC 2007 Wed May 30 04:38:21 UTC 2007 Wed May 30 04:38:22 UTC 2007 Wed May 30 04:38:40 UTC 2007 Wed May 30 04:38:41 UTC 2007 Wed May 30 04:38:42 UTC 2007 Wed May 30 04:38:43 UTC 2007 Wed May 30 04:38:44 UTC 2007
PHP CLI Status Indicator
Most times when people write command line scripts they just let the output flow down the screen as a status indicator, or just figure “it’s done when it’s done” But sometimes it would be nice to have a simple clean status indicator, allowing you to monitor progress and gauge time-to-completion. This is actually very easy to accomplish. Simply use \r instead of \r\n in your output. Obviously the example below is very simplified, and this can be applied in a much more sophisticated fashion. But it works.
$row_count = get_total_rows_for_processing();
$limit=10000;
echo "\r\n[ 0%]";
for ( $i=0; $i < = $row_count; $i = $i + $limit ) {
$query="SELECT * FROM table LIMIT {$limit} OFFSET {$i}";
// do whatever
$pct = round((($i+$offset)/$row_count)*100);
if ( $pct < 10 ) {
echo "\r[ $pct%]";
} else {
if ( $pct < 100 ) {
echo "\r[ $pct%]";
} else {
echo "\r[$pct%]";
}
}
}
echo "\r[100%]\r\n";
Backgrounding Chained Commands in Bash
Sometimes it’s desirable to have a chain of commands backgrounded so that a multi-step process can be run in parallel. And often times its not desirable to have yet another script made to do a simple task that doesn’t warrant the added complexity. An example of this would be running backups in parallel. The script sniplet below would allow up to 4 simultaneous tar backups to run at once — recording the start and stop times of each individually — and then wait for all the tar processes to finish before exiting
max_tar_count=4
for i in 1 3 5 7 2 4 6 8
do
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
if [ $cur_tar_count -ge $max_tar_count ]
then
while [ $cur_tar_count -ge $max_tar_count ]
do
sleep 60
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
done
fi
( echo date > /backups/$i.start &&
tar /backups/$i.tar /data/$i &&
echo date > /backups/$i.stop )&
done
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
while [ $cur_tar_count -gt 0 ]
do
sleep 60
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
done
The real magick above is highlighted in red. You DO want that last loop in there to make the script wait until all the backups are really done before exiting.
You’re only ever done debugging for now.
I’m the kinda guy who owns up to my mistakes. I also strive to be the kinda guy who learns from them. So I figured I would pass this on as some good advice from a guy who’s “screwed that pooch”
There was a project on which I was working, and that project sent me e-mail messages with possible problem alerts. All was going well, and at some point I turned off those alerts. I don’t remember when. And I don’t remember why. Which means I was probably “Cleaning up” the code. It was, after all, running well (I guess.) But along comes a bug introduced with new functionality (ironically a from somewhere WAAAAAAY up the process chain from my project). And WHAM, errors up the wazzoo. But no e-mails. Oops. Needless to say the cleanup process was long and tedious… especially for something that was avoidable.
I’ve since put the alerting code back into the application, and have my happy little helpers in place fixing the last of the resulting issues.
The lesson to be taken from this is that you’re only ever done debugging for now. Because tomorrow that code, thats working perfectly now, wont be working perfectly anymore. And that the sources for entropy are, indeed, endless.
DivShare, Day 1 (raw commentary)
I began looking at divshare a few days ago as a way to stor, save, and share my personal photo collection. The idea of auto-galleries, unlimited space, flash video, and possible FTP access was… enticing. But it’s tough to tell how something like this is going to work on a large scale…
So… after messing around with a free divshare account for a while I decided it was more worth my while to pay 10 bucks for a pro account and get FTP access than to try and use mechanize (or something similar) to hack out my own makeshift API. Now I have about… Oh… 8,000 files I want to upload… So… doing that 10 at a time was just _NOT_ going to happen…
After paying for a pro account I was *immediately* granted FTP access, no waiting. And for that I was grateful. Since I take photos at 6MP, and thats WAY too large for most online uses I have a shell script which automagically creates 5%, 10%, and 25% or original sized thumbnails. This meant that I had an expansive set of files I could upload and only take a couple of hours doing it (5% thumbs end up being less than 200Mb.) This, I thought, would be an excellent test of their interfaces.
So an-ftping-i-a-go. Upload all my files into a sub directory (005). Visit the dash. nothing. Visit the ftp-upload-page to recheck… maybe I did something wrong. AND WHAM! an 8,000 check box form to accept ftp uploaded files… ugh. Thankfully they’re all checked by default. I let it load (for a long while) and hit submit… and wait… and wait… and wait. Then the server side connection times out a while later. Fair enough. Check my dash… about 1500 of the 8,000 photos were imported… I’m going to have to do this 6 times. Annoying, but doable. Hit the second submit, and pop open another browser to look at my dash. And divshare did *nothing* with my folder name… that wasnt translated to a “virtual” folder at all. tsk tsk.
So I need to put about 1500 photo, manually, into an 005 folder… and then I realize… I have to do this 20 files at a time… with no way to just show files that are not currently in a folder.
… uh no …
Ok, so I open up one of the photos that I DID put into the 005 folder, and it did, in fact, make them into a “gallery” of sorts. It made a thumbnail , and displayed all 3,000 photos side by side in something similar to an iframe… no rows. just one row… 3,000 columns… and waiting as my browser requests each… and every… thumb… from divshare. Wonderful. The gallery controls are simple enough an iframe with a scrollbar at the bottom, a next photo link, and a previous photo link. And all 3 controls make you loose your place in the iframe when you use them…
Now dont get me wrong. You get what you pay for. But hey… I did pay this time ;) The service is excellent for what it does. And my use case was a bit extreme. Still I hope that they address these issues that I’ve pointed out. I’d really like to continue using them, and if they can make my pohoto process easier I’ll gladly keep paying them $10/mo
Thats
- Don’t ignore what pro users are telling you when they upload
- Process large-accepts in the background, let me know I need to come back later
- Negative searching (folder == nil)
- Mass file controls (Iether items/page, or all-items-in-view (folder == nil))
- Give me a gallery a non-broadband user can use (1500 thumbs in one sitting tastes bad, more filling)
- Don’t undo what I’ve done in the gallery every click. Finding your place among 8,000 photos is tedious to do once
And I know I sound like I’m just complaining. And I am. But this is web 2.0 feedback baby. Ignore my grouchiness, and (If I’m lucky) take my suggestions and run with them asap. The photo/files market is very very far from cornered!
Most people wont care…
Us web 2.0 and web 3.0 people have a hard time caring about the things that normal people care about. And we have a hard time believing that people don’t care about the things that we do. In short we’re a large group of very detached individuals who are, more or less, free to form ideas into substance in the vacuum of our own creation.
I often have a hard time coming to grips with this concept myself. WHAT DO YOU MEAN nobody will care about this idea?! It’s great. But after a while chewing on that, I’ll grudgingly admit that while it may be a great idea… Almost nobody will care.
So when I saw, a few days ago, a bit of a fuss being kicked up over google wanting your browsing history. I surprised myself by offhandedly thinking: “nobody but us cares.” And I still think that. As a matter of fact I think that in a utilitarian sense most everybody will embrace the idea.
The problem is in search. Google has taken keyword search straight to the edge. And now people are hungering for the next search. Search 4.5 beta. And that’s relevancy. I’m a dog lover (I have 3 large dogs) so let me give you an example from my world.
Lets assume I just got a new pupy and she’s SUPER submissive. Peeing all over, shakes, just scared. If I go to google and type “submissive bitch”… I don’t get what I was looking for. Now if google has my browser history and sees that I frequent the Chazhound Dog Forums now google has the information necessary to determine that I’m not looking for sex, but in fact dog related topics.
This is why, not only will they not care but, most people will embrace giving google more data. Sure I care. You care. But lets not fool ourselves into thinking that everybody else cares too ![]()
tags, items, users - loose the joins - gain the freedom.
Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (”But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.
You have only one database
Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.
Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much. But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.
So, N, here’s my answer: Split your database once, and then your databases once.
DB:/users
DB:/items
DB:/tags
becomes
DBTags:/tags
DBUsers:/users
DBItems:/items
And then
DBUsers:/users
Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.
DBItems:/
- items_id_ending_in_0
- items_id_ending_in_1
- items_id_ending_in_2
- items_id_ending_in_3
- items_id_ending_in_4
- items_id_ending_in_5
- items_id_ending_in_6
- items_id_ending_in_7
- items_id_ending_in_8
- items_id_ending_in_9
again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size
DBTags:/
- tags_crc_ending_in_0
- tags_crc_ending_in_1
- tags_crc_ending_in_2
- tags_crc_ending_in_3
- tags_crc_ending_in_4
- tags_crc_ending_in_5
- tags_crc_ending_in_6
- tags_crc_ending_in_7
- tags_crc_ending_in_8
- tags_crc_ending_in_9
Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.
You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.
So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…
Ec2a
- users (w)
- items 0-1 (w)
- tags 0-1 (w)
Ec2b
- items 2-3 (w)
- tags 2-3 (w)
Ec2c
- items 4-5 (w)
- tags 4-5 (w)
Ec2d
- items 6-7 (w)
- tags 6-7 (w)
Ec2e
- items 8-9 (w)
- tags 8-9 (w)
Ec2f
- items 0-1 (r)
- tags 0-1 (r)
Ec2g
- users (r)
- items 2-3 (r)
- tags 2-3 (r)
Ec2h
- items 4-5 (r)
- tags 4-5 (r)
Ec2i
- items 6-7 (r)
- tags 6-7 (r)
Ec2j
- items 8-9 (r)
- tags 8-9 (r)
So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…
So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.
Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing ![]()
Whoa, talk about neglecting your weblog! Bad Form!
I know, I know, I’ve been silent for quite some time. Well Let me assure you that I’m quite all right! Are you less worried about me now? Oh good. (Yes I’m a cynical bastage sometimes.)
So life has, as it tends to do, come at me pretty fast. I’ve left my previous employer, Ookles, and I wish them all the best in accomplishing everything that they’ve been working towards. So I’ve Joined up with the very smart, very cool guys at Automattic. I have to tell you I’m excited to be working with these guys, they’re truly a great group.
I guess that means I’m… kind of… like… obligated to keep up on my blog now, eh? I’m also kind of, like, ehausted. Jumping feet first into large projects has a tendency to do that to a guy though. And truth be told I would have it any other way…
Cheers
DK
Subscribe to the comments for this post