Bigger Aint Always Better

Recently I was troubleshooting some inefficiencies with the jobs systems locking and fetching queries at work. Like a good little boy I, originally, came up with one index which satisfied all the queries that I needed to run against this particular critical table.

 (`completed`,`heartbeat`,`priority`,`datacenter`,`worker`) 
-- datetime,datetime,tinyint,varchar(16),varchar(255)

the query looked like this:

SELECT `id`,GET_LOCK(CONCAT('foo_',`id`),0) as mylock FROM `foo_jobs` 
WHERE `completed` = '0000-00-00 00:00:00' AND `worker` = '' AND `dirty` = 0 
AND IS_FREE_LOCK(CONCAT('foo_',`id`)) = 1 AND `priority` = '0' LIMIT 1

This worked really well as long as the number of jobs in the table remained small.. But there was an event horizon of sorts… where after a certain number of rows (say 10k-25k) the query above took so long to get rows that the system could no longer catch up, but would fall invariably behind… Apparently, the problem was that there was too much data in the key. Because the requirements for these queries had changed slightly (due to other scaling improvements) I was able to simplify and break the indexes into parts which vastly outperform the previous index.

KEY `get_a_job` (`priority`,`workerpid`)
-- tinyint, int
KEY `janitor` (`completed`,`heartbeat`,`worker`,`dirty`)
-- datetime,datetime,varchar(255),tinyint

This works because workerpid is enough to tell us whether a job definition is in process (or was, and has not been cleaned yet.) and makes the following query run against a super small index…

SELECT `id`,GET_LOCK(CONCAT('foo_',`id`),0) as mylock FROM `foo_jobs`
WHERE `priority` = '0' AND `workerpid` IS NULL 
AND IS_FREE_LOCK(CONCAT('foo_,`id`)) = 1 LIMIT 1

If you’re interested in what dark monstrosity could possibly require queries like these (and are a brave PHP developer) I invite you to check out the jobs system


Posted on : Jun 15 2009
Posted under MySQL, PHP, Software Development |

Just what you need to know to write a CouchDB reduce function

Lets say you have the CouchDB classes (located here) all compiled together and included into your test.php script. Lets also say that you have created a database with the built-in web ui called “testing”. Finally let us say that your test.php has the following code in it, which would add a record to the db every time it is run. (i know that the data in the document serves no useful purpose… but really I just want to figure out this map/reduce thing so that I can make awesome views… so this suffices sufficiently.)

require_once dirname( dirname( __FILE__ ) ) . '/includes/couchdb.php';
$couchdb = new CouchDB('testing', 'localhost', 5984);
$key = microtime();
$result = $couchdb->send(
    '/'.md5($key),
    'put',
    json_encode(
        array(
            "_id" => md5($key),
            "time" => $key,
            'md5' => md5($key),
            'sha1' => sha1($key),
            'crc' => crc32($key)
        )
    )
);
print_r($result->getBody(true));

After running the code a bunch of times you would end up with a bunch of documents which look more or less like this:

picture-1(click for full size)

Now lets say you want to write a view that told you what the first characters of the _id were and how many documents share that first letter. This is analogous to the following in MySQL

SELECT LEFT(md5, 1) AS `lchar`, count(md5) FROM `md5table` GROUP BY `lchar`

Your map function is easy, because you dont have any selection criteria, so we process all rows

function(doc){ emit(doc._id,doc); }

The reduce function is where the actual programming comes in… And it seems there aren’t many well explained examples of exactly how to do this (I just brute forced it by trial and error)

function(key, values, rereduce) { 
    var output = {};
    if ( rereduce ) { 
        // key is null, and values are values returned by previous calls
	//
	// see http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
	//
	// essentially we are taking the previously reduced view, and the 
	// reduced view for new records, and we are reducing those two things
	// together.  Summarizing two summaries, essentially
        for ( var i in values ) {
	    // here we have multiple prebuilt output objects and we're simply combining them
  	    // just like below we have an array with a numeric id and an output object
	    // 
	    // retrieve a summary
            var vals = values[i];
            for ( var key in vals ) {
		// debugging
                // log(key);
		// 
		// store in or increment our new output object 
                if ( output[key] == undefined )
                    output[key] = vals[key];
                else
                    output[key] = output[key] + vals[key];
            }
        }
    } else {
        // key is an array, which we dont care about, and values are the 
	// values returned by the map
	//
	// see http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
	//
	// we are taking each document and processing that, reducing it down
	// to a summary object (output) for each of the rows passed
        for ( var i in values ) {
	    // we have an array, values, with numeric ids and a document objects
	    //
	    // retrieve a document
            var doc = values[i];
	    // get what we want from it, the first char of the md5
            var key = doc._id.substr(0, 1);
	    // debugging
            // log( key + " :: " + doc._id );
	    //
	    // store or increment the output object
            if ( output[key] !== undefined )
                output[key] = output[key] + 1;
            else
                output[key] = 1;
        }
    }
    // done
    return output;
}

and in code, using a temporary view, ( if you used this view all the time you would want to make it permanent… but this is about how to lay out a reduce function, nothing more ) so request code that looks like this

$view = array(
    'map' => 'function(doc){ emit(doc._id,doc); }',
    'reduce' => '
        function(key, values, rereduce) { 
            var output = {};
            if ( rereduce ) { 
                // key is null, and values are values returned by previous calls
                for ( var i in values ) {
                    var vals = values[i];
                    for ( var key in vals ) {
                        // log(key);
                        if ( output[key] == undefined )
                            output[key] = vals[key];
                        else
                            output[key] = output[key] + vals[key];
                    }
                }
            } else {
                // key is an array, which we dont care about, and values are the values returneb by the map
                for ( var i in values ) {
                    var doc = values[i];
                    var key = doc._id.substr(0, 1);
                    // log( key + " :: " + doc._id );
                    if ( output[key] !== undefined )
                        output[key] = output[key] + 1;
                    else
                        output[key] = 1;
                }
            }
            return output;
        }
    '
    );
$result = $couchdb->send('/_temp_view', 'POST', json_encode($view) );
print_r($result->getBody(true));

would give you output that looks like this:

stdClass Object
(
    [rows] => Array
        (
            [0] => stdClass Object
                (
                    [key] => 
                    [value] => stdClass Object
                        (
                            [0] => 15
                            [1] => 17
                            [2] => 16
                            [3] => 13
                            [4] => 27
                            [5] => 18
                            [6] => 26
                            [7] => 15
                            [8] => 18
                            [9] => 21
                            [a] => 12
                            [b] => 23
                            [c] => 20
                            [d] => 27
                            [e] => 28
                            [f] => 26
                        )
 
                )
 
        )
 
)

I hope this helps somebody out.


Posted on : Feb 18 2009
Posted under API, Business, CLI, MySQL, PHP, Random Thoughts, Software Development, Web Stuff |

random php… a multi-channel chat rooom class using memcached for persistence

why? i dunno… just because… just a toy…

no sql, no flat file, no write permissions required anywhere, no fuss

class mc_chat {
 
        var $chan = null;
        var $mc = null;
        var $ret = 5;
 
        function __construct($memcached, $channel, $retention=5) {
                $this->mc = $memcached;
                $this->chan = $channel;
                $this->ret = $retention;
        }
 
        function messages( $from=0 ) {
                $max = (int)$this->mc->get("$this->chan:max:posted");
                $min = (int)$this->mc->get("$this->chan:min:posted");
                $messages = array();
                for ( $i=$min; $i< =$max; $i++ ) {
                        if ( $i < $from )
                                continue;
                        $m = $this->get($i);
                        if ( $m['user'] && $m['message'] )
                                $messages[$i] = $m;
                }
                return $messages;
        }
 
        function get($id) {
                return array(
                        'user' =>(string)$this->mc->get("$this->chan:msg:$id:user"),
                        'message' => (string)$this->mc->get("$this->chan:msg:$id"),
                );
        }
 
        function add($user, $message) {
                $id = (int)$this->mc->increment("$this->chan:max:posted");
                if ( !$id ) {
                        $id=1;
                        $this->mc->set("$this->chan:max:posted", 1);
                }
                $this->mc->set("$this->chan:msg:$id:user", (string)$user);
                $this->mc->set("$this->chan:msg:$id", (string)$message);
                if ( $id >= $this->ret ) {
                        if ( !$this->mc->increment("$this->chan:min:posted") )
                                $this->mc->set("$this->chan:min:posted", 1);
                }
 
        }
 
}
 
$mc = new Memcache;
$mc->connect('localhost', 11211);
$keep_messages = 10;
$chatter_id = 1;
$chat = new mc_chat($mc, 'chat-room-id', $keep_messages);
$chat->add($chatter_id, date("r").": $chatter_id : foo");
$chat->messages(37); // messages only above id=37
$chat->messages(); // all the latest messages

Posted on : Feb 16 2009
Posted under PHP, Random Thoughts |

a dumbed down version of wpdb for sqlite

I’ve been working, gradually, on a project using an sqlite3 database (for its convenience) and found myself missing the clean elegance of wpdb… so I implemented it. It was actually really easy to do, and I figured I would throw it up here for anyone else wishing to use it. The functionality that I build this around is obtainable here: http://php-sqlite3.sourceforge.net/pmwiki/pmwiki.php (don’t freak… its in apt…)

With this I can focus on the sql, which is different enough, and not fumble over function names and such… $db = new sqlite_wpdb($dbfile, 3); var_dump($db->get_results(”SELECT * FROM `mytable` LIMIT 5″));

the code is below… and hopefully not too mangled…

< ?php
 
class sqlite_wpdb {
 
        var $version = null;
        var $db = null;
        var $result = null;
        var $error = null;
 
        function sqwpdb($file, $version=3) { 
                return $this->__construct($file, $version); 
        }
 
        function __construct($file, $version=3) {
                $function = "sqlite{$version}_open";
                if ( !function_exists($function) )
                        return false;
                if ( !file_exists($file) )
                        return false;
                if ( !$this->db = @$function($file) )
                        return false;
                $this->version = $version;
                $this->fquery = "sqlite{$this->version}_query";
                $this->ferror = "sqlite{$this->version}_error";
                $this->farray = "sqlite{$this->version}_fetch_array";
                return $this;
        }
 
        function escape($string) {
                return str_replace("'", "''", $string);
        }
 
        function query($query) {
                if ( $this->result = call_user_func($this->fquery, $this->db, $query) )
                        return $this->result;
                $this->error = call_user_func($this->ferror, $this->db);
                return false;
        }
 
        function array_to_object($array) {
                if ( ! is_array($array) )
                        return $array;
 
                $object = new stdClass();
                foreach ( $array as $idx => $val ) {
                        $object->$idx = $val;
                }
                return $object;
        }
 
        function get_results($query) {
                if ( !$this->query($query) )
                        return false;
                $rval = array();
                while ( $row = $this->array_to_object(call_user_func($this->farray, $this->result)) ) {
                        $rval[] = $row;
                }
                return $rval;
        }
 
        function get_row($query) {
                if ( ! $results = $this->get_results($query) )
                        return false;
                return array_shift($results);
        }
 
        function get_var($query) {
                return $this->get_val($query);
        }
 
        function get_val($query) {
                if ( !$row = $this->get_row($query) )
                        return false;
                $row = get_object_vars($row);
                if ( !count($row) )
                        return false;
                return array_shift($row);
        }
 
        function get_col($query) {
                if ( !$results = $this->get_results($query) )
                        return false;
                $column = array();
                foreach ( $results as $row ) {
                        $row = get_object_vars($row);
                        if ( !count($row) )
                                continue;
                        $column[] = array_shift($row);
                }
                return $column;
        }
 
}
 
?>

Posted on : Nov 18 2008
Posted under API, Business, CLI, Linux, MySQL, PHP, Personal, Random Thoughts, Software Development, Web Stuff |

Writing your own shell in php

I’ve always wanted to write my own simple shell in php.  Call me a glutin for punishment, but it seems like something that a lot of people could use to be able to do… If your web app had a command line interface for various things… like looking up stats, or users, or suspending naughty accounts, or whatever…. wouldnt that be cool and useful?  Talk about geek porn.  Anyways this this morning I got around to tinkering with the idea, and here is what i came up with… It’s rough, and empty, but its REALLY easy to extend and plug into any php application.

apokalyptik:~/phpshell$ ./shell.php

/home/apokalyptik/phpshell > hello

hi there

/home/apokalyptik/phpshell > hello world

hi there world

/home/apokalyptik/phpshell > cd ..

/home/apokalyptik/ > cd phpshell

/home/apokalyptik/phpshell > ls

shell.php

/home/apokalyptik//phpshell > exit

apokalyptik:~/phpshell$ ./shell.php

See the source here: shell.phps


Posted on : Aug 03 2008
Posted under API, Business, CLI, Linux, MySQL, PHP, Personal, Random Thoughts, Software Development, Web Stuff |

Bash Coding Convention ../

We use dirname() a lot in php to make relative paths work from multiple locations like so. The advantages are many:

require dirname( dirname( __FILE__ ) ) . '/required-file.php';
$data = file_get_contents( dirname(__FILE__).'/data/info.dat');

But in bash we often dont do the same thing, we settle for the old standby “../”. Which is a shame because unless your directory structure is set up exactly right, and you have proper permissions, and you run the command from the right spot, it doesnt work as planned. I think part of the reason is that its not obvious how to reliably get a full path to the script from inside itself. Another reason is that ../ is shorter to type and easier to remember. Finally there’s always one time scripts for which this methodology is overkill. But if you’re planning to write a script which other people will (or might) be using, I think it’s good practice to do it right. Googling for things you’d think to search for on this subject does not yeild very informative results, or incomplete (incorrect) methods… so… here’s how to do the above php in bash:

source $(dirname $(dirname $(readlink -f $0)))/required-file.sh 
data=$(cat $(dirname $(readlink -f $0))/data/info.dat)

Hope this helps someone :)

As a side note, the OSX readlink binary functions differently. You’ll want to use a package manager to install gnu coreutils, and iether use greadlink, or link greadlink to a higher precedence location on your $PATH (I have /opt/local/bin:/opt/local/sbin: at the beginning of my $PATH)


Posted on : Jun 19 2008
Posted under Bash, Business, CLI, Linux, OS X, PHP, Software Development |

This deserves some link love

Andy bogged a piece of advice that I have him which I got from Barry… and if you want to know how to get the true absolute path to the real location of the current script is from inside of it (like phps realpath and __FILE__) I suggest you check it out


Posted on : May 09 2008
Posted under Business, CLI, Linux, PHP, Software Development, Web Stuff |

Logging post data

Lets say you have a relatively complex php web application, like wordpress. You have it running under apache (which is common.) You have good control of your site via .htaccess (which is common — permalinks and all.) And something happens to your blog (e.g. someone is exploiting some unknown vulnerability to compromise your content), which you want to track down. So you want to log, for instance, HTTP POST data. Your first instinct might be to add some logging code to index.php (mine was) But there are a lot of possible places which might be directly accessed, especially in the admin. So The trick I use (and this is probably the only time I’ve ever condoned this) is to use PHPs auto_prepend_file functionality.

Make a /home/user/postlog/ directory, then a /home/user/postlog/logs/ directory (and chmod a+rw that one.) Next make a simple /home/user/postlog/postlog.php file with the following contents:

&lt;?php 
if ( isset($_POST) &amp;&amp; is_array($_POST) &amp;&amp; count($_POST) &gt; 0 ) { 
  $log_dir = dirname( __FILE__ ) . '/logs/'; 
  $log_name = "posts-" . $_SERVER['REMOTE_ADDR'] . "-" . date("Y-m-d-H") . ".log"; 
  $log_entry = gmdate('r') . "\t" . $_SERVER['REQUEST_URI'] . "\r\n" . serialize($_POST) . "\r\n\r\n"; 
  $fp=fopen( $log_dir . $log_name, 'a' ); 
  fputs($fp, $log_entry); 
  fclose($fp); } 
?&gt;

Finally add this line to the top of your .htaccess file:

php_value auto_prepend_file /home/user/postlog/postlog.php

If all went well this should start logging any request to any php file with any post data into the /home/user/postlog/logs/ direcory (with a unique log per ip per day)


Posted on : Mar 25 2008
Posted under Blogging, PHP, Web Stuff |

It’s good for the server. It’s good for the soul.

ack (http://petdance.com/ack/), love it (thanks nikolay)


colorizing php cli scripts

It’s pretty common in most scripting languages which center around the command line (bash, perl, etc) to find information on colorizing your shell script output, mainly because those languages are tied very tightly to command line use. It can be difficult, however, to find information about adding this same nice feature to a php cli script. The reason is simple: most people dont use php for cli applications; most cli programmers use something else. It’s not difficult to adapt the same techniques listed in most bash howtos (generally in the section reserved for colorizing your command prompt) for generating colored terminal output for php.

echo "\033[31mred\033[37m\r\n";
 
echo "\033[32mgreen\033[37m\r\n";
 
echo "\033[41;30mblack on red\033[40;37m\r\n";

Simple, functional, useful (even if a bit complicated.) I leave it to you to lookup a bash prompt colorization howto to hunt down your own list of escape color codes (call it homework.)

Cheers


Posted on : Nov 20 2007
Posted under CLI, Linux, PHP, Software Development |