PHP CLI Status Indicator

Most times when people write command line scripts they just let the output flow down the screen as a status indicator, or just figure “it’s done when it’s done” But sometimes it would be nice to have a simple clean status indicator, allowing you to monitor progress and gauge time-to-completion. This is actually very easy to accomplish. Simply use
instead of
in your output. Obviously the example below is very simplified, and this can be applied in a much more sophisticated fashion. But it works.

$row_count = get_total_rows_for_processing();
$limit=10000;
echo "\r\n[  0%]";
for ( $i=0; $i < = $row_count; $i = $i + $limit ) {
  $query="SELECT * FROM table LIMIT {$limit} OFFSET {$i}";
  // do whatever
  $pct = round((($i+$offset)/$row_count)*100);
  if ( $pct < 10 ) {
    echo "\r[  $pct%]";
  } else {
    if ( $pct < 100 ) {
      echo "\r[ $pct%]";
    } else {
      echo "\r[$pct%]";
    }
  }
}
echo "\r[100%]\r\n";

Backgrounding Chained Commands in Bash

Sometimes it’s desirable to have a chain of commands backgrounded so that a multi-step process can be run in parallel. And often times its not desirable to have yet another script made to do a simple task that doesn’t warrant the added complexity. An example of this would be running backups in parallel. The script sniplet below would allow up to 4 simultaneous tar backups to run at once — recording the start and stop times of each individually — and then wait for all the tar processes to finish before exiting

max_tar_count=4
for i in 1 3 5 7 2 4 6 8
 do
 cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
 if [ $cur_tar_count -ge $max_tar_count ]
  then 
  while [ $cur_tar_count -ge $max_tar_count ]
   do
   sleep 60
   cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
  done
 fi
 ( echo date > /backups/$i.start &&
     tar /backups/$i.tar  /data/$i &&
     echo date > /backups/$i.stop )&
done
cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
while [ $cur_tar_count -gt 0 ]
 do
 sleep 60
 cur_tar_count=$(ps wauxxx | grep -v grep | grep tar | wc -l)
done

The real magick above is highlighted in red. You DO want that last loop in there to make the script wait until all the backups are really done before exiting.

tags, items, users – loose the joins – gain the freedom.

Long time readers of this blog will assert that I have no problem presenting an unpopular opinion, and/or sticking my foot in my mouth. Some times both at once! (“But wait… there’s more!”) So when N. Shah asks me how he should split his database (a tags table, items table, and users table) I say: The answer is in the question.

You have only one database

Lets drop the pretense folks. Lets come back to the real world. This is the web 2.0 world. Data is growing at a seriously exponential. And desperate times call for desperate measures.

Joins are nice. They’re pretty. They’re convenient. They keep us from having to think very much.  But they do NOT promote using commodity hardware for your databases. They just don’t. No, really, an in-database join chains you to an in-database solution. You *could* keep upgrading and upgrading… faster processors… larger disks… faster raid… And then you move to buying SAN’s and you’re talking about some serious cash for that cache. Or… You think about things differently. You put in a little work up front. And you break the mold. Because one database ties you to one server. And that, my friends, is the problem.

So, N, here’s my answer: Split your database once, and then your databases once.

DB:/users

DB:/items

DB:/tags

becomes

DBTags:/tags

DBUsers:/users

DBItems:/items

And then

DBUsers:/users

Pretty simple… users tend to be a small table, and keeping them in one place makes a lot of sense here. HOWEVER. depending on your architecture and uses you could easily split the users as we do the tags (not items) below.

DBItems:/

  • items_id_ending_in_0
  • items_id_ending_in_1
  • items_id_ending_in_2
  • items_id_ending_in_3
  • items_id_ending_in_4
  • items_id_ending_in_5
  • items_id_ending_in_6
  • items_id_ending_in_7
  • items_id_ending_in_8
  • items_id_ending_in_9

again, pretty simple. you have your run of the mill integer item id’s split them by the last digit of your item id, and you can reduce the footprint of any one table to 1/10th of the whole dataset size

DBTags:/

  • tags_crc_ending_in_0
  • tags_crc_ending_in_1
  • tags_crc_ending_in_2
  • tags_crc_ending_in_3
  • tags_crc_ending_in_4
  • tags_crc_ending_in_5
  • tags_crc_ending_in_6
  • tags_crc_ending_in_7
  • tags_crc_ending_in_8
  • tags_crc_ending_in_9

Now here is a little bit of voodoo. You have these tags, and tags are words. And I like numbers. Numbers make life easy. So by creating a CRC32 hash of the word, and storing it with the tag {id|tag|crc332} you can quickly reverse the tag to an id, and then go find items with that tag id associated, while still retaining the ability to split the db by powers of 10.

You can still use your join tables items_to_users, and tags_t_items, these tables consisting of ints take up almost _NO_ space whatsoever, and so can go where convenient (if you query items for users more than users for items, then put the join table in the users db) but you cant actually preform in-server full joins any longer. Heck you can even keep two copies of the join data, items_to_tags in the items dbs, and tags_to_items in the items dbs.

So, like many things in life, going cheaper meant going a bit harder. But what did we gain? Well lets assume 10 ec2 instances…

Ec2a

  • users (w)
  • items 0-1 (w)
  • tags 0-1 (w)

Ec2b

  • items 2-3 (w)
  • tags 2-3 (w)

Ec2c

  • items 4-5 (w)
  • tags 4-5 (w)

Ec2d

  • items 6-7 (w)
  • tags 6-7 (w)

Ec2e

  • items 8-9 (w)
  • tags 8-9 (w)

Ec2f

  • items 0-1 (r)
  • tags 0-1 (r)

Ec2g

  • users (r)
  • items 2-3 (r)
  • tags 2-3 (r)

Ec2h

  • items 4-5 (r)
  • tags 4-5 (r)

Ec2i

  • items 6-7 (r)
  • tags 6-7 (r)

Ec2j

  • items 8-9 (r)
  • tags 8-9 (r)

So thats a total of about… oh… 1.6 terrabytes of space… 18gb of RAM, 17Ghz of processor speed, and an inherently load balanced set of database instances. And when you need to grow? split by the last 2 (16TB) digits, 3(160Tb) digits, 4(1,600TB) digits…

So, now that you’ve read to the bottom. It’s 1:00am, way past my bed time. Remember that when designing a database you — above all — need to listen to your data. Nobody will come up with a solution that perfectly fits your problem (thats why its called “your problem”) but techniques can be applied, and outlooks can be leveraged.

Disclaimer: some or all of this might be wrong, there may be better ways, dont blame me. I’m sleep-typing 😉

CryoPID

Now this is cool: CryoPID a process freezer for linux.

“CryoPID allows you to capture the state of a running process in Linux and save it to a file. This file can then be used to resume the process later on, either after a reboot or even on another machine.

CryoPID was spawned out of a discussion on the Software suspend mailing list about the complexities of suspending and resuming individual processes.

CryoPID consists of a program called freeze that captures the state of a running process and writes it into a file. The file is self-executing and self-extracting, so to resume a process, you simply run that file. See the table below for more details on what is supported.”

I find myself wondering: Could this be a new way of distributing interpreted language desktop apps as binary files without releasing the source?

Infinidisk Update

I mentioned a while back that I was going to be playing with the S3 Infinidisk Product.  What I found in my testing was that this product is not prime time ready.  There was a nasty bug which caused data to be lost if the mv command was used. The scripts themselves were unintuitive.  They required fancy-pants nohupping or screening to use long term.  Oh, and a database definitely will not work on top of this FS. It seems obvious in retrospect, but I wanted to be sure.  InnoDB wont even build its initial files much less operated on the FS.  To top it all off, My pre-sales support question was never even so much as acknowledged.

No, I think I’ll be leaving this product alone for now and sticking with clever uses of s3sync and s3cmd, thanks.

Google & Microsoft Working Towards the Perfect Datacenter

We all new that this would happen, google and microsoft going vying to build the biggest field of silicon trees.  But what does this mean, and does it tie in with amazons latest service?!  I think that undoubtedly it does.

There’s talk about a last man standing game when it comes to internet bandwidth.  And I can imagine a time when we might see the internet behaving like the freeways in L.A. at rush hour.  But this is more, I think.

I’ve mentioned before that the whole goal here is to “be the internet”.  I don’t think that goal has changed recently.  Google has sown the world two things:  First that there’s a vast amount of power to be wielded by being “the internet” to the average Tom Dick and Harry, and Second that the title is *always* up for grabs.  A while back Yahoo! was the internet, before that AOL was the internet, before that newsgroups were the internet.  Need I say more?  And each of those companies wielded an extreme sway over the comings and goings of the internet.

But now the internet means a lot more than it used to.  Now the internet is sales, it’s revenue, it’s marketing, people are watching, people are reading, people are listening, and– most importantly — people are being influenced by this “new fangled internet thing”, “oh, you mean Google?”

So there’s now a lot more riding on who gets to “be the internet” these days.  The one thing that ginormous corporate entities can’t seem to get a hold of is the fickle way in which the internet is backwards from real world businesses.  In the real world it’s all too common for a newcomer to storm into a market, take hold of it with genuinely better product, and then let all that slip away into mediocrity and poor quality.  And the kicker is that people will *still* pay for it if it’s crap… as long as its tangible. But the internet is fickle. It’s sort of tangible but more or less ethereal.

I think for the first time people outside the scientific communities are getting wind of a crazy idea: insubstantial value.  That is something that didn’t have value a minute ago, wont have value a minute from now, but at the moment is extremely valuable.  Which, inherently, means that this thing has the constant need to justify itself.  I’m no economics guy, and I’m certainly not in touch with the “average Joe” (who would almost certainly not follow me through more than two or three blog posts) but I think the difference here is that there’s no physical reality to intimidate us.

We don’t have to grow particularly attached to anything on the internet because it’s not “in our lives” we’re in its life. It doesn’t take up space in our house, we take up space in its house. For once in our lives we find that we aren’t the ones who are at the mercy of demand, but are – in fact – in demand.  It’s a feeling of empowerment that is slowly but surely changing the world. Mark my words children n classrooms 100 years from now will be studying the historical impact of all of the events which are happening before our eyes at this very moment, in this place that’s not a place.

I think I’ve become side tracked.  Oh yes, consumers being in demand, corporations unable to handle the discrepancies of the actions of the same people online and off line, and… Ahh yes… The underdog.

Why, do you think, it is that in this virtual world so often it’s the couple of guys who met in college coding outside a cafe, or this dude in his moms basement, or a couple of people who tried to do one thing but failed fantastically into doing something else completely right?  Because people of talent are, all of a sudden, relinquished of the necessity to offer anything physical… People with a talent for the ethereal, all of a sudden, have a place in which the ethereal acquires value.

And, as in any underdog story, these small (sometimes rare) meteoric rises to the top will carry others with them.  And these are the kind of people who remember the hands that helped them up.

So, sure, bandwidth and all that.  But the people who make it easiest for those suited to developing the intangible will have everything to gain in the long run. Amazon sees this, and is doing an amazing job with it.  Their recent successes with S3, SQS, and EC2, are testimony to their understanding of this new ecosystem.  But they ought not to think that Google and Microsoft haven’t noticed this and where the young blood is heading.

Make no mistake, amazon has made extremely agile, grassroots, moves to “be the internet” from the bottom up… But there will soon be a clash of services as G and M do the same from the “top down” and “sideways in” respectively.

I will say this: The first company to crack the database problem will have a distinct advantage in the struggles to come.

Disclaimer: Everything I just said is more than likely to be complete nonsense as I just kind of rambled it out “stream of consciousness” style .

Amazon Ec2 Cookbook: Startup Flexibility

Disclaimer: these code segments have not been really “tested” verbatim. I assume anyone who is successfully bundling EC2 images (that run) will know enough about copying shell scripts off blogs to test for typos, etc! Oh, and, sorry for lack of indentation on these… I’m just not taking the time 🙂

As I’ve been searching for a way of using ec2 in a production environment. To keep it as simple as possible, but also eliminate the need for unnecessary (and extremely tedious) image building both during and after the development process (development of the AMI, not the service). This is what I’ve come up with.

Step 1: our repository
Create a subversion repository which is web accessible and password protected (of course) like so:

  • ami/trunk/init.sh
  • ami/trunk/files/
  • ami/tags/bootstrap.sh
  • ami/tags/

ami/tags/bootstrap.sh would read:

#!/bin/bash

BootLocation=”ami/trunk”
BootHost=”svnhost.com”
BootUser=”userame”
BootPass=”password”
BootProtocol=”http”

## Prepare the bootstrap directory
echo -en “\tPreparing… ”
if [ -d /mnt/ami ]
then
rm -rf /mnt/ami
fi
mkdir -p /mnt/ami/
if [ $? -ne 0 ]; then exit $?; else echo “OK”; fi
## populating the bootstrap
echo -en “\tPopulating… ”
svn –force \
–username $BootUser \
–password $BootPass \
export \
$BootProtocol://$BootHost/$BootLocation/ \
/mnt/ami/ 1>/dev/null 2>/dev/null
if [ $? -ne 0 ]; then exit $?; else echo “OK”; fi
chmod a+x /mnt/ami/init.sh
## hand off
echo -e “\tHanding off to init script…”
/mnt/ami/init.sh
exit $?

ami/trunk/init.sh would read something like:

#!/bin/bash
## Filesystem Additions/Changes
echo -en “\t\tSynchronizing System Files… ”
cd /mnt/ami/files/
for i in $(find -type d)
do
mkdir -p “/$i”
done
echo -en “d”
for i in $(find -type f)
do
cp -f “$i” “/$i”
done
echo -en “f”
echo ” OK”
## Any Commands Go Here
## All Done!
exit 0

Step 2: configure your AMI

  • create /etc/init.d/servicename
  • chkconfig –add servicename
  • chkconfig –levels 345 servicename on
  • /etc/init.d/servicename should look something like:

    #! /bin/sh
    #
    # chkconfig: – 85 15
    # description: Ec2 Bootstrapping Process
    #
    RETVAL=0
    case “$1” in
    start)
    /usr/bin/wget \
    -o /dev/null -O /mnt/bootstrap.sh \
    http://user:pass@svnhost/ami/tags/bootstrap.sh
    /bin/bash /mnt/bootstrap.sh
    RETVAL=$?
    ;;
    stop)
    exit 0
    ;;
    restart)
    $0 start
    RETVAL=$?
    ;;
    *)
    echo “Usage: $0 {start|stop|restart}”
    exit 1
    ;;
    esac
    exit $RETVAL

    And now when the AMI boots itself up we hit 85 during runlevel 3 bootup (well after network initialization), servicename starts, and the bootstrapping begins. We’re then able, with our shell scripts, to make a great deal of changes to the system after the fact. These changes might be bugfixes, or they might be setup processes to reconstitute a database and download the latest code from a source control repository located elsewhere… They might be registration via a DNS API… anything at all.
    The point is that some flexibility is needed, and this is one way to build that in!

    EC2 S3 PGSQL WAL PITR Infinidisk: The backend stack that just might change web services forever!

    I have written mostly about MySQL here in the past. The reason for this is simple: MySQL is what I know. I have always been a die hard “everything in its place and a place for everything” fanatic. I’ll bash Microsoft with the best of them, but I still recognize their place in the market. And now it’s time for me to examine the idea of PostgreSQL And this blog entry about amazon web services is the reason. I don’t claim to exactly agree with everything said here… as a matter of fact I tend to disagree with a lot of it… but I saw “PS: Postgresql seems to win hands down over MySQL in this respect; WAL is trivial to implement with Postgresql)” and thought to myself: “hmm, whats that?” I found the answer in the PostgreSQL documentation on Write Ahead Logging (WAL) and it all made sense! The specific end goal here is Continuous Archiving and Point-In-Time Recovery (PITR). This plus the S3 Infinidisk certainly do make for an interesting concept. One that I am eager to try out! I imagine that the community version of infinidisk would suffice here since we’re not depending on random access here… that ought to make for some chewy goodness!