Just because you build it, doesnt mean they will come

Here’s a small bit of advice for all you would-be “cloud storage providers.” Just because you have a buttload of disks doesn’t mean people will be falling over themselves to use your software. If I have to spend *any* of my time worrying about your load, storage, or other internal algorithms (or unnecessary limitations for that matter) then YOU . HAVE . FAILED.

If I have to take the time to shard my data into 4096 different containers because you couldn’t be bothered to think “hey what if a service with a lot of users that create a lot of stuff decides to use us as a store?” Then you’re obviously not in it to win it (so to speak.)

Give us ABSTRACTED storage. Non abstracted storage we can do on our own thank you.


Posted on : Mar 19 2009
Posted under API, Business, Random Thoughts, Software Development, Web Stuff |

Just what you need to know to write a CouchDB reduce function

Lets say you have the CouchDB classes (located here) all compiled together and included into your test.php script. Lets also say that you have created a database with the built-in web ui called “testing”. Finally let us say that your test.php has the following code in it, which would add a record to the db every time it is run. (i know that the data in the document serves no useful purpose… but really I just want to figure out this map/reduce thing so that I can make awesome views… so this suffices sufficiently.)

require_once dirname( dirname( __FILE__ ) ) . '/includes/couchdb.php';
$couchdb = new CouchDB('testing', 'localhost', 5984);
$key = microtime();
$result = $couchdb->send(
    '/'.md5($key),
    'put',
    json_encode(
        array(
            "_id" => md5($key),
            "time" => $key,
            'md5' => md5($key),
            'sha1' => sha1($key),
            'crc' => crc32($key)
        )
    )
);
print_r($result->getBody(true));

After running the code a bunch of times you would end up with a bunch of documents which look more or less like this:

picture-1(click for full size)

Now lets say you want to write a view that told you what the first characters of the _id were and how many documents share that first letter. This is analogous to the following in MySQL

SELECT LEFT(md5, 1) AS `lchar`, count(md5) FROM `md5table` GROUP BY `lchar`

Your map function is easy, because you dont have any selection criteria, so we process all rows

function(doc){ emit(doc._id,doc); }

The reduce function is where the actual programming comes in… And it seems there aren’t many well explained examples of exactly how to do this (I just brute forced it by trial and error)

function(key, values, rereduce) { 
    var output = {};
    if ( rereduce ) { 
        // key is null, and values are values returned by previous calls
	//
	// see http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
	//
	// essentially we are taking the previously reduced view, and the 
	// reduced view for new records, and we are reducing those two things
	// together.  Summarizing two summaries, essentially
        for ( var i in values ) {
	    // here we have multiple prebuilt output objects and we're simply combining them
  	    // just like below we have an array with a numeric id and an output object
	    // 
	    // retrieve a summary
            var vals = values[i];
            for ( var key in vals ) {
		// debugging
                // log(key);
		// 
		// store in or increment our new output object 
                if ( output[key] == undefined )
                    output[key] = vals[key];
                else
                    output[key] = output[key] + vals[key];
            }
        }
    } else {
        // key is an array, which we dont care about, and values are the 
	// values returned by the map
	//
	// see http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
	//
	// we are taking each document and processing that, reducing it down
	// to a summary object (output) for each of the rows passed
        for ( var i in values ) {
	    // we have an array, values, with numeric ids and a document objects
	    //
	    // retrieve a document
            var doc = values[i];
	    // get what we want from it, the first char of the md5
            var key = doc._id.substr(0, 1);
	    // debugging
            // log( key + " :: " + doc._id );
	    //
	    // store or increment the output object
            if ( output[key] !== undefined )
                output[key] = output[key] + 1;
            else
                output[key] = 1;
        }
    }
    // done
    return output;
}

and in code, using a temporary view, ( if you used this view all the time you would want to make it permanent… but this is about how to lay out a reduce function, nothing more ) so request code that looks like this

$view = array(
    'map' => 'function(doc){ emit(doc._id,doc); }',
    'reduce' => '
        function(key, values, rereduce) { 
            var output = {};
            if ( rereduce ) { 
                // key is null, and values are values returned by previous calls
                for ( var i in values ) {
                    var vals = values[i];
                    for ( var key in vals ) {
                        // log(key);
                        if ( output[key] == undefined )
                            output[key] = vals[key];
                        else
                            output[key] = output[key] + vals[key];
                    }
                }
            } else {
                // key is an array, which we dont care about, and values are the values returneb by the map
                for ( var i in values ) {
                    var doc = values[i];
                    var key = doc._id.substr(0, 1);
                    // log( key + " :: " + doc._id );
                    if ( output[key] !== undefined )
                        output[key] = output[key] + 1;
                    else
                        output[key] = 1;
                }
            }
            return output;
        }
    '
    );
$result = $couchdb->send('/_temp_view', 'POST', json_encode($view) );
print_r($result->getBody(true));

would give you output that looks like this:

stdClass Object
(
    [rows] => Array
        (
            [0] => stdClass Object
                (
                    [key] => 
                    [value] => stdClass Object
                        (
                            [0] => 15
                            [1] => 17
                            [2] => 16
                            [3] => 13
                            [4] => 27
                            [5] => 18
                            [6] => 26
                            [7] => 15
                            [8] => 18
                            [9] => 21
                            [a] => 12
                            [b] => 23
                            [c] => 20
                            [d] => 27
                            [e] => 28
                            [f] => 26
                        )
 
                )
 
        )
 
)

I hope this helps somebody out.


Posted on : Feb 18 2009
Posted under API, Business, CLI, MySQL, PHP, Random Thoughts, Software Development, Web Stuff |

making munin-graph take advantage of multiple cpus/cores

I do a lot of things for Automattic, and many of the things I do are quite esoteric (for a php developer anyways.) Perl is not my language of choice, but I’ve never balked at a challenge…. just… did it have to be perl? Anyways. We have more than a thousand machines that we track with munin… which means a TON of graphs. munin-update is efficient, taking advantage of all cpus and getting done in the fastest time possible, but munin-graph started taking so long as to be useless (and munin-cgi-graph takes almost a minute to fully render the servers day/week summary page which is completely unacceptable when we’re trying to troubleshoot a sudden, urgent, problem.) So I got to dive in and make it faster…

Step 1: add in this function (which i borrowed from somewhere else)

sub afork (\@$&) {
  my ($data, $max, $code) = @_;
  my $c = 0;
  foreach my $data (@$data) {
    wait unless ++ $c < = $max;
    die "Fork failed: $!\n" unless defined (my $pid = fork);
    exit $code -> ($data) unless $pid;
  }
  1 until -1 == wait;
}

Step 2: replace this

for my $service (@$work_array) {
    process_service ($service);
}

with this

afork(@$work_array, 16, \&process_service);

I also have munin-html and munin-graph running side-by-side

( [ -x /usr/local/munin/lib/munin-graph  ] &&
    nice /usr/local/munin/lib/munin-graph --cron $@ 2>&1 |
    fgrep -v "*** attempt to put segment in horiz list twice" )& $waitgraph=$!
( [ -x /usr/local/munin/lib/munin-html   ] && nice /usr/local/munin/lib/munin-html $@; )& $waithtml=$!
wait $waitgraph
wait $waithtml

I did several other, more complicated hacks as well. Such as not generating month and year graphs via cron, letting those render on-demand with munin-cgi-graph

All said we’re doing in under 2.5 minutes what was taking 7 or 8 minutes previously


Posted on : Jan 23 2009
Posted under Business, CLI, Linux, Software Development |

Using wait, $!, and () for threading in bash

This is a simplistic use of the pattern that I wrote about in my last post to wait on multiple commands in bash. In essence I have a script which runs a command (like uptime or restarting a daemon) on a whole bunch of servers (think pssh). Anyways… this is how I modified the script to run the command on multiple hosts in parallel. This is a bit simplistic as it runs, say, 10 parallel ssh commands and then waits for all 10 to complete. I’m very confident that someone could easily adapt this to run at a constant concurrency level of $threads… but I didn’t need it just then so I didn’t go that far… As a side note, this is possibly the first time I’ve ever *needed* an array in a bash script… hah…

# $1 is the commandto run on the remote hosts
# $2 is used for something not important for this script
# $3 is the (optional) number of concurrent connections to use
 
if [ ! "$3" == "" ]
then
    threads=$3
else
    threads=1
fi
 
cthreads=0;
stack=()
for s in $servers
  do
    if [ $cthreads -eq $threads ]; then
        for job in ${stack[@]}; do
              wait $job
        done
        stack=()
        cthreads=0
    fi
    (
        for i in $(ssh root@$s "$1" )
            do
                echo -e "$s:\t$i"
        done
    )& stack[$cthreads]=$!
    let cthreads=$cthreads+1
done
for job in ${stack[@]}; do
    wait $job
done

Posted on : Dec 11 2008
Posted under Bash, Business, CLI, Linux, Random Thoughts, Software Development, Web Stuff |

bash – collecting the return value of backgrounded processes

You know that you can run something in the background in a bash script with ( command )&, but a coworker recently wanted to run multiple commands, wait for all of them to complete, collect and decide what to do based on their return values… this proved much trickier. Luckily there is an answer

#!/bin/bash
 
(sleep 3; exit 1)& p1=$!
(sleep 2; exit 2)& p2=$!
(sleep 1; exit 3)& p3=$!
 
wait "$p1"; r1=$?
wait "$p2"; r2=$?
wait "$p3"; r3=$?
 
echo "$p1:$r1 $p2:$r2 $p3:$r3"

Posted on : Dec 05 2008
Posted under Bash, Business, CLI, Linux, Random Thoughts, Software Development |

a dumbed down version of wpdb for sqlite

I’ve been working, gradually, on a project using an sqlite3 database (for its convenience) and found myself missing the clean elegance of wpdb… so I implemented it. It was actually really easy to do, and I figured I would throw it up here for anyone else wishing to use it. The functionality that I build this around is obtainable here: http://php-sqlite3.sourceforge.net/pmwiki/pmwiki.php (don’t freak… its in apt…)

With this I can focus on the sql, which is different enough, and not fumble over function names and such… $db = new sqlite_wpdb($dbfile, 3); var_dump($db->get_results(”SELECT * FROM `mytable` LIMIT 5″));

the code is below… and hopefully not too mangled…

< ?php
 
class sqlite_wpdb {
 
        var $version = null;
        var $db = null;
        var $result = null;
        var $error = null;
 
        function sqwpdb($file, $version=3) { 
                return $this->__construct($file, $version); 
        }
 
        function __construct($file, $version=3) {
                $function = "sqlite{$version}_open";
                if ( !function_exists($function) )
                        return false;
                if ( !file_exists($file) )
                        return false;
                if ( !$this->db = @$function($file) )
                        return false;
                $this->version = $version;
                $this->fquery = "sqlite{$this->version}_query";
                $this->ferror = "sqlite{$this->version}_error";
                $this->farray = "sqlite{$this->version}_fetch_array";
                return $this;
        }
 
        function escape($string) {
                return str_replace("'", "''", $string);
        }
 
        function query($query) {
                if ( $this->result = call_user_func($this->fquery, $this->db, $query) )
                        return $this->result;
                $this->error = call_user_func($this->ferror, $this->db);
                return false;
        }
 
        function array_to_object($array) {
                if ( ! is_array($array) )
                        return $array;
 
                $object = new stdClass();
                foreach ( $array as $idx => $val ) {
                        $object->$idx = $val;
                }
                return $object;
        }
 
        function get_results($query) {
                if ( !$this->query($query) )
                        return false;
                $rval = array();
                while ( $row = $this->array_to_object(call_user_func($this->farray, $this->result)) ) {
                        $rval[] = $row;
                }
                return $rval;
        }
 
        function get_row($query) {
                if ( ! $results = $this->get_results($query) )
                        return false;
                return array_shift($results);
        }
 
        function get_var($query) {
                return $this->get_val($query);
        }
 
        function get_val($query) {
                if ( !$row = $this->get_row($query) )
                        return false;
                $row = get_object_vars($row);
                if ( !count($row) )
                        return false;
                return array_shift($row);
        }
 
        function get_col($query) {
                if ( !$results = $this->get_results($query) )
                        return false;
                $column = array();
                foreach ( $results as $row ) {
                        $row = get_object_vars($row);
                        if ( !count($row) )
                                continue;
                        $column[] = array_shift($row);
                }
                return $column;
        }
 
}
 
?>

Posted on : Nov 18 2008
Posted under API, Business, CLI, Linux, MySQL, PHP, Personal, Random Thoughts, Software Development, Web Stuff |

Postfix, DKIMproxy, Spamc

If you’re running any moderately busy mail server you’re probably using spamassassins spamc/spamd to check for spam because its tons more efficient than piping the mail through the spamassassin cli. Assuming that you do, and that you plan on adding DKIM proxy to the mix to verify  and sign emails you need to put things in the right order, to save you some headache here’s what I did:

  1. smtp|smtps => -o smtpd_proxy_filter=127.0.0.1:10035 # outgoing dkim verify port
  2. 127.0.0.1:10036 => -o content_filter=spamassassin
  3. spamassassin =>  pipe user=nobody argv=/usr/bin/spamc -f -e /usr/sbin/sendmail -oi -f ${sender} ${recipient} # this delivers to the “pickup” service
  4. pickup => -o content_filter=dksign:127.0.0.1:10037 # outgoing dkim signing port
  5. 127.0.0.1:10038 => -o content_filter= # the buck stops here

If you arent careful with these (which I wasnt) you’ll end up causing an infinite loop between your filters (which I did).  Thus concludes our public service announcement.


Posted on : Nov 14 2008
Posted under Business, CLI, Funny Stuff, Linux, Security, Software Development, Web Stuff |

Writing your own shell in php

I’ve always wanted to write my own simple shell in php.  Call me a glutin for punishment, but it seems like something that a lot of people could use to be able to do… If your web app had a command line interface for various things… like looking up stats, or users, or suspending naughty accounts, or whatever…. wouldnt that be cool and useful?  Talk about geek porn.  Anyways this this morning I got around to tinkering with the idea, and here is what i came up with… It’s rough, and empty, but its REALLY easy to extend and plug into any php application.

apokalyptik:~/phpshell$ ./shell.php

/home/apokalyptik/phpshell > hello

hi there

/home/apokalyptik/phpshell > hello world

hi there world

/home/apokalyptik/phpshell > cd ..

/home/apokalyptik/ > cd phpshell

/home/apokalyptik/phpshell > ls

shell.php

/home/apokalyptik//phpshell > exit

apokalyptik:~/phpshell$ ./shell.php

See the source here: shell.phps


Posted on : Aug 03 2008
Posted under API, Business, CLI, Linux, MySQL, PHP, Personal, Random Thoughts, Software Development, Web Stuff |

Internally Caching Longer Than Externally Caching

We use varnish for a lot of our file caching needs, and recently we figured out how to do something rather important through a combination of technologies. Imagine you have backend servers generating dynamic content based on user input. So your users do something that fits the following categories:

  • is expensive to generate dynamically, and should be served from cache
  • many requests come in for the same objects, bandwidth should be conserved
  • doesnt change very often
  • once changed needs to take effect quickly

Now wish varnish we’ve been using the Expires header for a long time with great success, but for this we were having no luck. If we set the expires header to 3 weeks, then clients also cache the content for 3 weeks (violating requirement #3.) We can kill the Expires header in varnish at vcl_deliver, but then clients don’t cache at all (#2.) We can add Content-Control, overwrite the Age (otherwise reported Age: will be greater than max-age), and kill the Expires headers in the same place, but this isn’t pretty, and seems like a cheap hack. Ideally we could rewrite the Expires header in varnish, but that doesn’t seem doable.

So what we ended up doing, was header rewriting at the load balancer (nginx.) inside our location tag we added the following:

proxy_hide_header Age;
proxy_hide_header Expires;
proxy_hide_header Cache-Control;
add_header Source-Age $upstream_http_Age;
expires  300s;

Now nginx setsa proper Cache-Control: and Expires: headers for us, disregarding what varnish serves out. Web clients dont check back for 5 minutes (reusing the old object) and varnish can cache until judgment dat because we get wild card invalidation

Isn’t technology fun?!


Posted on : Jul 16 2008
Posted under Business, Linux, Random Thoughts, Web Stuff |

Command line arguments in bash scripts

This is something that has always annoyed me about bash scripts… The fact that it’s difficult to run

/path/to/script.sh --foo=bar -v -n 10 blah -one='last arg'

So I decided to write up a bash function that let me easily (once the function was complete) access this type of information. And because I like sharing, here it is:

#!/bin/bash
function getopt() {
  var=""
  wantarg=0
  for (( i=1; i< =$#; i+=1 )); do
    lastvar=$var
    var=${!i}
    if [ "$var" = "" ]; then 
        continue 
    fi
    echo \ $var | grep -q -- '='
    if [ $? -eq 0 ]; then
      ## -*param=value
      var=$(echo \ $var | sed -r s/'^[ ]*-*'/''/)
      myvar=${var%=*}
      myval=${var#*=}
      eval "${myvar}"="'$myval'"
    else
      echo \ $var | grep -E -q -- '^[ ]*-'
      if [ $? -eq 0 ]; then
        # -*param$
        var=$(echo \ $var | sed -r s/'^[ ]*-*'/''/)
        eval "${var}"=1
        wantarg=1
      else
        echo \ $var | grep -E -- '^[ ]*-'
        if [ $? -eq 0 ]; then
          # the current one has a dash, so cannot be
          # the argument to the last parameter
          wantarg=0
        fi
        if [ $wantarg -eq 1 ]; then
          # parameter argument
          val=$var
          var=$lastvar
          eval "${var}"="'${val}'"
          wantarg=0
        else
          # parameter
          if [ "${!var}" = "" ]; then
            eval "${var}"=1
          fi
          wantarg=0
        fi
      fi
    fi
  done
}
 
OIFS=$IFS; IFS=$(echo -e "\n"); getopt $@; IFS=$OIFS

now at this point (assuming the above command line parameter and script) I should have access to the following variables: $foo (”bar”) $v (1) $n (10) $blah (1) $one (”last arg”), like so:

OIFS=$IFS; IFS=$(echo -e "\n"); getopt $@; IFS=$OIFS
 
echo -e "
foo:\t$foo
v:\t$v
n:\t$n
blah:\t$blah
one:\t$one
"

You might be curious about this line:

OIFS=$IFS; IFS=$(echo -e "\n"); getopt $@; IFS=$OIFS

IFS is the variable that tells bash how strings are separated (and mastering its use will go a long way towards enhancing your bash scripting skills.) Anyhow, by default IFS=" " which normally is OK, but in our case we dont want "last arg" to be two seperate strings, but one. I cannot put the IFS assignment inside the function because by that point bash has already split the variable, it needs to be done at a level of the script in which $@ has not been touched yet. So I store the current IFS variable in $OIFS (Old IFS) and set IFS to a newline character. After running the function we reassign IFS to what it was beforehand. This is because I dont know what you might be doing with your IFS. There are lots of reasons you might have already assigned it to something else, and I wouldnt want to break your flow. So we do the polite thing.

And in case the above gets munged for some reason you can see the plain text version here: bash-getopt/getopt.sh

Anyways, hope this helps someone out. If not it's still here for me when *I* need it ;)


Posted on : Jul 07 2008
Posted under Bash, Business, CLI, Linux, Software Development |