Toying with the idea of a podcast

I cant say that I like hearing my own voice… but maybe with some appropriate filters I can sound like the bad guy in a poorly made kidnapping film, and that would suite me. I really get going once I start physically talking (just ask my wife) so it might be a better medium for me thantext delivery.

On the other hand. I dont have a commute at present. And I dont have an IPOD. And when I’m in front of my PC I’m usually concentrating very intensly… So I never listen to anyone elses podcasts. Which would make my producing one a possibly sweet (but more likely fairly bitter) irony.

But, what the hell, it’s worth a shot right? What are your thoughts?

Taking Requests

It’s the end of a long, hard, day and I’m thinking about writing a blog entry. But… I usually only write when something is happening somewhere which gets me sparked… Because I’m at a loss when trying to think about what would be interesting or not to other people. So I figure I’ll take requests. I have maybe a reader base of 10 one-time readers, and 2 full time readers (unless I dont count then its probably just 1… :D) but surely you guys must have some topics that you’d like some input on? So think of this as a text-based radio station and it’s the all-request lunch hour.

Or, if no one responds, I’ll have to just wait till the muse strikes…

An 90 second introduction to multidimensional arrays in PHP

If a word is a variable, then a sentence is an array. A paragraph is an array of sentences, a chapter is an array of paragraphs, a book is an array of chapters, and a library is an array of books. This would look, in php, like this:

  $library = array (
    'book1' => array (
      'chapter1' => array (
        'sentence1' => array (
          'word1' = "In",
          'word2' = "the",
          'word3' = "beginning",
        ),
      ),
    ),
  ); 

Therefor $library[‘book1’][‘chapter1’][‘sentence1’][‘word2’] is “the”. And $library[‘book1’][‘chapter1’][‘sentence1’] is equal to array ( ‘word1’ => “In”, ‘word2’ => “the”, ‘word3’ => “beginning”, );

And thats an array. Thus closes our discussion on arrays in PHP … huh? whats that? oh… you need more? Well sure there are a zillion uses for arrays, and learning to think in arrays often takes running into a situation where using anything else becomes less than viable. But for the sake of argument lets pretend we’re keeping simple track of deposits, withdrawals, and a balance. In this app every transaction invariably has a couple of pieces of information: Transaction date, Second Party, Amount, And a type (deposit or withdrawal).

array (
  'date'   => $$,
  'type'   => $$,
  'party'  => $$,
  'amount' => $$,
)

Our balance sheet is simply an array of those arrays

$sheet=array (
  '0' => array (
    'date' => 'monday',
    'type' => 'd',
    'party' => 'employer',
    'amount' => 1234.56,
  );
  '1' => array (
    'date' => 'tuesday',
    'type' => 'w',
    'party' => 'rent',
    'amount' => 500,
  );
  '2' => array (
    'date' => 'wednesday',
    'type' => 'w',
    'party' => 'computer store',
    'amount' => 712.59,
  );
);

This, while fictitious, should give a good example of how a multidimensional array works. We can get a balance with a very simple loop using php’s foreach() conrol structure.

$balance=0;
foreach ( $sheet as $transaction_id => $details ) {
  switch ( $details['type'] ) {
    case 'w':
      $balance=$balance - $details['amount'];
      break;
    case 'd':
      $balance=$balance + $details['amount'];
      break;
  }
  echo "[{$details['type']}]\t "
        ."{$details['party']}\t "
        ."Amount: {$details['amount']}\t "
        ."Balance: {$balance}
";
}

That is basically everything you need to know to start (of COURSE there’s more to learn) working with multidimensional arrays, except for one thing. When you’re faced with working with somebody else’s data structures you will need to get information about how they are laying out there arrays. The slow painful way of doing this is examining the code. The quick happy way us to use either var_dump() or print_r(). I prefer print_r for most jobs just remember to wrap the output of print_r in <pre></pre> tags if you’re doing this debugging in a browser… trust me. it’ll help a lot.

Rules of thumb for high availability systems (Infrastructure)


Never be at more than ½ capacity

If you’re planning a truly highly available system then you have to be aware that a serious percentage of your hardware can be forcefully cut from your organizations torso at any moment. You are also not exempt from this rule on holidays, weekends, or vacations. Loosing power equipment, Loosing networking gear, the help tripping over cables, acts of God. If you aren’t prepared to have a random half of your organizations hardware disconnected at any moment then you aren’t H.A. yet.


If you don’t have 2 spares then you arent yet ready

Murphy was an optimist. If you’ve ever replaced a dying (or dead) hard drive with a new hard drive which… doesn’t work. Or ram, or a CPU. Then you haven’t been in ops long enough. Sometimes your backup plan needs a backup plan. And you have to have it. Theres no excuse for being off line, so you need not only one but two (or more) possible replacements for a point of failure.


Disaster Recovery is an ongoing process

The tricky thing about Highly Available systems is that you have to be working… while you’re recovering. Any time you’re planning your HA setup, and you work around a point of failure, stop and think a moment on what it will take to replace that failed point. If it required bringing things down again… thats no good.


Growth planning should always be done in exponents

Never again are you to talk (or think) of doubling growth. You shall from this point forward think in squares, and cubes, and the like. In the age of information you tend to gather data at an alarming rate, don’t let it overtake you!


If you depend on a backup, it’s not HA

“Whats that? The primary server is off line? Do we have a spare? No, but we have a backup. How long? Oh… 36 hours… What? No, I can’t speed it up.” Lets face it if you’re restoring your live system from backup you’ve screwed the pooch. Backup is NOT high availability but it is good practice, and when it comes down to it 36 hours is marginally better than never.


Self healing requires more thought than you’ve given it

The simple fact of life in the data center is that all services are an interlocking tapestry. And if the threads break down the tassels fall off. Self healing is not only about detection and removal its also about rerouting data. If the database server that you normally write to has gone down, you can detect it, but can you instantly rewire the 8 different internal services which feed into the database to write to a different server? And then back again?


DNS is harder than you think, and it matters more than ever

The one infrastructure that people rely on most, and know the least about, is DNS. Dns might as well be a piece of hardware, because if your users cant type in www.blah.com to get to you, theres absolutely zero chance they’ll have your IP address handy. Worse yet, DNS is the number one thing that I see administrators screw up all the time. Talking zone files with (sometimes veteran) administrators is like talking in Klingon to a 2 year old. It usually doesn’t work too well.


Rules of thumb for high availability systems (Databases)


Replicating data takes longer than you think

In this brave new world of terrabytes per week theres a nasty truth. Replicating that much data across a large number of nodes is a headache. And it’s usually not as fast as you want it to be. Instantaneous replication is nice, but generally speaking you’re writing to one server and reading from X number of others. Your read servers, therefor, not only bar the same load as the write server (having to replicate everything that goes into the write server) but has to bear the additional load of supporting the read requests. A frequent mistake that admins make is putting the best hardware into the write server, and using lesser machines for the read servers. But if you’re truly processing large amounts of data this create a dangerous situation where if a read server stops for a while it might take days or weeks to catch up. Bad juju.


Less is more, and then more is more, and then less is more again

In the beginning you had data optomization. Everything pointed to something, and your masterfully crafted database schema duplicated absolutely no piece of information. And then you increased your size and volume to the point that this approach became too cumbersome to sustain your access time. You moved over to a new schema where you could select all the data you need in one statement, but data is duplicated everywhere. And finally this monolithic approach has locked you into multi-million dollar pieces of hardware, so you need to re-normalize your data so that it can be partitioned onto multiple clusters. Expect this, Plan for it, and be prepared for the hard truth: this is a truly painful process!


Spend the money here, if nowhere else

If you deal in information, you absolutely have to spend real money here. This is not the place to skimp. If you do… you’ll be sorry.


Rules of thumb for high availability systems (Employees and Departments)


False positives breed contempt

If you routinely get SMS alerts for no reason at 3:00am when you’re sound asleep. And it always ends up being a false alarm. There will come a time when you just opt to ignore the pager. And this time not only will wolf have been cried, the flock is truly under attack. Always always work to reduce false positives, and set reasonable alerting thresholds. Is something an emergency worth getting up for at 3:00am? Or isn’t it? Sure a web server went down, and was removed. But there are 13 others all functioning. You can sleep. But if you lost half of them… somethings probably up!


No department is an island

Contrary to popular belief, it takes more than the ops department to design a truly HA system. For example, your admins aren’t allowed to just start monkeying with the database schema when they feel like it. Sure its more highly available now, but the application cant use it any more. Just as no man is an island, neither is the ops department. You can work with them (good) or you can work against them (bad) but choose wisely.


If operations warns that the sky is going to fall, take them seriously

Lets face it. If your auto mechanic says your alternator will die very soon – you replace it. If your inspector says you’ve got the beginnings of a termite problem – you adddress it. If your weatherman tells you it might rain today – you grab your umbrella on your way out the door. And when your ops team comes into your office telling you that you have exactly 90 days until your database server becomes a very heavy very hot very expensive paper weight – why would you ignore that? Usually when ops says the sky is about to fall it’s because they were up in the clouds fixing the slightly off color shade of silver you were complaining about and saw the cracks forming themselves. Ignore them at your own risk, but don’t say they didn’t warn you.


If you don’t spend the money on ops, nothing runs.

Without your engine your car doesnt run. Without your heart you die. And without giving the ops necessary resources department the application that you’ve invested so heavily in will not run. Because there will be nothing to run it on. Or worse yet: it’ll run but something will break every 3rd day. You cannot skimp here. Well you can, but you don’t get high availability as well as a low price tag. It’s a pain in the ass… but when you bought the Saturn you had no right to expect Nascar results.

The RDBMS Misconception That Less is More

It’s commonly held that normalization is a good thing. And it is. But like all good, or more to the point TRUE, things there are circumstances in which the opposite hold true.

The “proper” way to layour a database schema is something as ever changing as the tides. Rather like the US justice system we find that things which once held true no longer do. Or that things which were once absolute do, actually, have extenuating circumstances under which they arent — exactly — absolute.

The proper way to lay out an RDBMS system is to look at a very simple ratio: Space VS Speed. The less duplication of data in your database the more efficient (in terms of space disk used) it is. In exchange for that dis space savings you incur the cost of additional disk seeks.

For example, if you’re keeping track of your users information (e.g. who’s registered and who hasnt) You might use a table like this:

Users: |  userId | firstName | lastName | eMail | cryptPasswd |

But in all likelyhood you’re going to have a lot of users with a common first and last name! Normalization to the rescue (or so it seems — at first)

Users: | userId | firstNameId | lastNameId | eMail | cryptPasswd |
FirstNames: | firstNameId | firstName |
LastNames: | lastNameId | lastName |

Now, instead of storing the string “John” a thousand times for the thousand users with the first name of John, you store the string once, and you have an integer field which related (the R in RDBMS) to a normallized list of names.

But… the cost is that now any time you want to pull a name from the table it requires 3 lookups.

select firstNameId,lastNameId from Users where userId = 1
select firstName from FirstNames where firstNameId=x
select lastName from LastNames where lastNameId=y

Where the same would have been done with the following query before normalization

select firstName, lastName from Users where userId=1

It gets worse when you’re computng values based on information stored in your tables. For example if you are looking for the number of times a user has visited a certain page, so that you can show them the information on the page they are viewing (or perhaps to do some checking on that value each time they visit to prevent, for example, site mirroring). You might already be storing what people are doing on the site in a table called UserActionLog for debugging, tracking, or statistical purposes. And you use the data in that table to run reports on a, say, weekly basis.

You COULD use something like this to gather the information about the user each time they visit a page:

select count(pageId) from UserActionLog where userId=x and pageId=y

But you will probbaly find that duplicating this data is a much more cpu effective, though disc inefficient, way of solving the problem. Storing something like this in a new table would yeild a much faster result for something which will be accessed continuously

PageVisitsByUser: | pageId | userId | totalVisits | lastVisit |

Now is this always going to hold true? Well no. The places you’ll find where it doesnt matter are the places in which you have WAY more resources than your needs require. For example you only have 100 users, and get hits on pages which require database access rarely. Applications like this dont need optomization because the advancing state of computing hardware *IS* the optomization that they need.

However as you process more and more volume you’ll find time and time again that a 1/1000 second per hit advantage is an 11.5 DAY (1,000,000 seconds) savings for 1 billion hits… even with only a million hits thats a 16 minute per day savings. You can see how the savings stacks up when you start adding in powers of 10

Thats the real challenge of the Web2.0 movement. Finding the amount of data versus the need to use that data which hits the sweet spot. What can we do with what we’ve got that people want?. I’d argue that as warfare in the 20th century was defined by gunpoweder, Web2.0 is a battle defined by its data schema

Myth: Linux doesnt need updates out of the box

I’ve just installed a fresh (from the dvd) Fedora Core 5 install. I checked all packages available to me in the installer (except the languages, because I’m monolingual) and “$ yum update” is now downloading 389 updates (thats almost 1GB)

So while I still think that the *nix OS’s are *WAY* better than the MS OS’s… The idea that linux doesnt need as many security updates out of the box as windows, is clearly a myth.

Unless: You installed the Linux release as soon as it came out (I.E. during the initial mirroring process), *OR* you built your OS from scratch. Even then over the course of your installed lifetime you’ll be applying a *LOT* of patches (or upgrades if you wish)

As a side note: The number of securioty updates being low would be, in my mind, a bad thing. You *WANT* your OS people to be consious of the fact that there are other people smarter than they are 🙂

DK

Learning a fact is easy, Learning to think is hard

Often times theres something that will happen throughout my day, and It’ll spark in me to talk, yet again, about how to learn.

Most of the times when someone is considered "smart" it’s because they know a lot of things. Having a good memory, yes, is indicative of a smart person. But it’s not that uncommon to find people who can remember detail to the Nth degree who arent very good thinkers

And thinking is what makes a person smart.

Lets be clear, here: being able to read a book and then remember all of its contents is *nice* but that does not make a person smart. What makes a person smart is being able to apply what’s in the book to varying situations. Read that last bit again. I didnt say that remembering the books contents and being able to apply them was what makes a person smart, remembering wasnt even a part of it. I also, specifically, mentioned varying situations. Being able to remember that a source (book, article, web site, etc) touches on a subject is quite arguably more important than being able to remember what that source says about the subject. Why? I’m glad you asked.

Very few references (and most everything is a reference these days) tell you how to think about a subject. References simply give you information about a subject. I’ll refer to “knowing about a subject” as “learned.” So a reference can make you a “learned” person, but it cannot make you an “intelligent” person. To be intelligent requires application and thats something that a reference simply cannot provide.

But to be able to look at one problem… lets say… mow many apples you can buy for $11… and to reach back into your “learning” and come up with the idea that cross multiplication can tell you how many. Thats “intelligent.” Even if you then have to go look up how to do it again making that connection is the key to intelligence.

I’m sure that Mark Twain would agree with me that too many folk walk around proclaiming the virtue of intelligence when, in fact, possess only the sin of regurgitation

The bottom line is that if you desire to learn to be intelligent, stop trying to memorize books, and start looking for the relationships around you. How the grass relates to the rain. How the wind relates to the chimes. How the time of day relates to the temperature relates to the month. Those are intelligent thoughts to have. That the wind chimes make sound of C-minor is a learned, though not necessarily intelligent, thought.

Amazon EC2 – “oh shit bars”

In this article Isabel equates Amazons EC2 to a vehicles “Oh shit bars.” Which, I think, Is a very valid use of the service. But lets not overlook the dirty little secret of “web two point oh” which is: Theres a *LOT* of data to be crunched.

There… I said it. See, people dont want you to understand that its actually fairly hard to maintain a growing web2 app. This whole Idea of social networks generates a WHOLE LOT OF DATA. And storing it is less the problem than analyzing it.

You see its not very good for making money to say “Working with all this data is hard.” Even when it should be. I think VC’s (who pay most 2.0 paychecks, AFAIK) like to hear “not only is it innovative… I could train a monkey to do it its so easy” Which is a load of crap because if it were then *everyone* would be doing it, and it would be Web-17.5_release_candidate_14 — If you catch my drift.

Like I always say a database at 5GB doesnt behave the same way once it reaches 500GB, and then at 5TB it’s another beast yet again. People talk about amazons EC2 as a utility computing platform — which it is — and then describe it in terms of web hosting. I think that misses the boat entirely. Yes you could use it for your web hosting needs. And I’m sure it will be good at that. But the gap that EC2 really fills isnt that one.

Right now people are saying: “Look, its a new tree!”, and later on they’ll be wondering “where did this blasted forest come from?”, and further down the road still our children will be laughing at us saying “It’s the Amazon you old fogeys… of course its got trees!”