Since now seems to be the time for making predictions…

Let me make a few of my own “The internet during 2007” predictions.

  • 2007 will be the year of the format wars. Differing schemas of XML will battle it out for the spot as the predominant method of sharing data in 2007. Because one size will never fit all we’ll probably end up with 2 or 3 schema layouts, varying in complexity and power. Almost nobody will end up using 2 of the 3 😀
  • 2007 will usher in the client side scripting wars. Will we still be using a J for AJAX after 2007? Probably, But I bet there will be some headway in finding a more modern, less quirky-by-vendor web scripting language. Something will do for client side scripting what PHP and Ruby have done for server side scripting.
  • We’ll see real action in the database-as-a-service mindshare. I’d expect Amazon and Google to weigh in on this action. Microsoft will likely sit out, though it would be a very stupid idea. If Microsoft provided a RESTable and SOAPable database service at a decent cost, they’d soon find themselves up to their ears in just the kind of data that an internet presence should covet! Specifically, though, we’ll see work in 2 areas easy databasing (think simple one to one and one to many relationships, like tags, terms, definitions, etc) and relational databases (think many-many, foreign key, transactions)
  • 2007 will see more wasted bits and bytes than gas — with all the uncompressed data interchange formats, and spam, flying around we’ll be wasting vast amounts of resources on the parts of consumable data that isnt really consumable. We probably wont see an answer to this in 2007, or if we do it wont be realized for some time to come.
  • Someone will attempt to retrofit e-mail. They wont succeed even though everybody is pulling for them to succeed.
  • New phones will be developed and released which have better web 2.0 support. Because after 2006 ends it’s not web2.0 anymore… it’s the web! (at least I’m hoping this is true)
  • We’ll see a large number of “old dog” programmers moving from the “hot and hip” web space to the mobile space. There’s such a generation gap between modern browsers mobile browsers that the progression will be pretty natural for those who dont feel like learning “new tricks”
  • People will continue to play with new ideas of making the internet social, That wont likely fall off, but what I suspect we’ll see are ways of making the internet more manifested. ether in facilitating physical meetings or actions, or in making a web presence manifest as a physical presence.
  • More real world data will be mapped into databases available for processing next year than ever before, from surveys to spatial analysis to trendy places to hang out. We’ll be moving closer and closer to making virtual space analogous to physical space. You wont have to walk to the corner store’s web site (where would the romance in THAT be?) but you can bet we’ll be coming closer and closer to your next door neighbors kids lemonade stand having a web presence.
  • Last but not least privacy will slip farther and farther towards unattainable. With so many vectors, so many reasons, so many locations in which one telling piece of information is being stored online, being invisible will be nearly impossible, and staying that way doubly so. But as we slip into a mode where identify is completely fabricatable… what, then does the theft of that identity mean?

So I’m trying to solve a rather difficult problem…

I have X number of data objects with up to about min 45,000, max 50,000 possible values associated with each object. I know… Don’t ask (couldn’t tell you anyways). Now to be able to do this in MySQL is… well… possible… but absurd. I’m thinking of trying out the approach I’ve mused about here. It could possibly be a really great way to manage finding commonalities across tens of thousands of objects with a total of hundreds of millions of values. Or it could be a massive time sink.

It would also put some of the things that Amazon has said about their S3 service to the test 🙂 I doubt anyone’s really stored a hundred million objects in an S3 bucket and been concerned enough with seek time to be critical about it 🙂

Or am I missing some magic bullet here. Is there a (free) DBMS I’m not thinking of which handles 50,000 columns in a table with a fast comparitive lookup? (select pk from table where v42000 = (select v42000 from table where pk = referencePk))… I’d love for someone to pop in and say “HEY, STUPID, DB XXX CAN DO THAT!” 🙂 assuming DB XXX isnt expensive 🙂

Hmm… Off to ponder…

mod_fcgid and the cfgi gem for rails with apache2 on fedora core 6

Because its not readily apparent when you’re concentrating on learning rails… I’m posting this here. While trying to get rails running on Fedora Core 6 (FC6) I was running into not being able to compile the ruby fcgi gem.

What gives, right? mod_fcgid (what FC6 comes with) is supposed to be binary compatible with mod_fastcgi. And there *IS* no mod_fastcgi or fastcgi in yum! Well it turns out that

mod_fcgid and mod_fastcgi both connect to fastcgi. That bears repeating. mod_fcgid is roughly equivalent to mod_fastcgi but niether are equivalent to fastcgi. (which in retrospect seems obvious as so many things do when you’re searching for answers)

so… on FC6, install mod_fcgid. then download and install fastcgi (not mod_fastcgi) from http://www.fastcgi.com/dist/ and then

gem install fcgi --source \
  http://rubyforge.planetargon.com/gems.rubyforge.org/ \
  -- \
  --with-fcgi-include=/usr/local/include \
  --with-fcgi-lib=/usr/local/lib

will work and everything will be peachy-keen

since a lot of people will be searching about this in a specific scope for Fedora Core 6, and since nobody seems to directly state this issue in plain terms, here it is. Google, eat me up! 😉

EC2 Economics, People just don’t seem to get it

Should you or shouldn’t you use Amazons EC2 service? If you believe everything you read without bothering to send it through the hype filter the answer is invariably YES! But in reality (where most of the rest of us live) the question actually depends on a lot of different factors.

The, theoretically, most straight forward of those factors is raw cash. Lets say you were considering purchasing a low end server from, say, Servpath. You’re looking at $1999 per year for a server with a better CPU, but less RAM and HDD space. Versus (365 * 24 * 0.10 = ) $876 with an EC2 instance. Easy, right?

AHH but the devil is in the details. you get 0 (zero, none, nada, zilch, less than any) bandwidth with your Ec2 instance. Thats extra. Lets assume you use all 1500Gb per month you get with your low end server, the price at servpath is still $1999 for the year. The price at EC2 goes up by (1500 * 12 * 0.20 = ) $3600. Even if you pay monthly instead of yearly at servpath its only ($199 * 12 = ) $2388.

Of course if you use significantly less bandwidth… say… 300GB per month…

Servpath: $1999,
Amazon: $876 + (300 * 12 * 0.20 = ) $720 = $1596

But the real complexity comes when you worry about scaling, or handling peak versus off times. And further complexity when you start talking high availability, and load balancing. And yet further if you host static and dynamic content in different places, etc.

The bottom line is that all of these blog posts I see about the choice between traditional versus new-age hosting is not as simple as it is made out to be. Before you jump on the buzzword bandwagon you should really make sure that you’re investing in what makes *sense* for your business and not in what made sense for someone else.

Right about now you’re scratching your head and wondering whether I’m telling you TO use EC2 or NOT TO use EC2. And you’re reading this whole thing wrong (if thats what you’re wondering.) What I am telling you is that EC2 is an extremely flexible and versatile tool which has a huge possible number of advantageous scenarios into which it fills a void previously delegated to the realm of “there’s no solution other than to spend more money or not. period.” You use the sed utility when it makes sense, right? You wouldn’t, for example, attempt to use it to accept http uploads? Of course not that doesn’t even make sense! Well not all situations make sense for EC2 either.

EC2 is a lot like the OSS movement. The up side is that It gives people the power of flexibility and choice. But much like the “Is linux desktop ready?” debate thats been raging on for years you have to deal with the downside which is that it gives people the power of flexibility and choice. Double edged blade indeed. But well worth the risks… If you can learn to wield it properly!

Thoroughly confused yet? That isn’t even the half of it! 🙂

So my little brother got fired…

So my little brother had a job working for a storage company. He had only held the job for about 2 weeks, and last Wed he was fired (no reason given). They failed to have his check ready that day; furthermore they failed to have gotten it to him today. They heartily promise to have it ready for him tomorrow morning. My wife and I are going to go down their with him to pick it up. Seems the California labor code was broken in several respects.

Apparently, according to sections 201 and 227.3 of the labor code, they were legally obligated to have his last check on hand and give it to him before he walked out the door. Which they failed to do.

Furthermore they’ve not given him his required last paycheck for a full 7 days (Nov 15 – 21.) So we looked into the labor code and found this little gem: “An employer who willfully fails to pay any wages due a terminated employee (discharged or quit) in the prescribed time frame may be assessed a waiting time penalty. The waiting time penalty is an amount equal to the employees daily rate of pay for each day the wages remain unpaid, up to a maximum of 30 calendar days.”

“Thanks” to the great state of California for making this information so clear and concisely available here:

http://www.dir.ca.gov/dlse/FAQ_Paydays.htm

So, 2 weeks owed wages at about 30 hours per week at about $8.50 per hour?

$433 after taxes

Failing to pay the above amount for 7 days?

$476

The “new girl” breaking the law,
Costing your company 2 times what you should have had to pay,
And realizing that this particular ex employee and friends are, unfortunately, not to be the idiots you’ve been hoping they would be?

Priceless

Once more into the spam filter

I was getting the following error when running “/usr/bin/sudo -u vpopmail -H /usr/bin/spamassassin -D bayes –lint”, per the instructions:

[6932] info: rules: meta test DIGEST_MULTIPLE 
    has undefined dependency 'RAZOR2_CHECK'
[6932] info: rules: meta test DIGEST_MULTIPLE 
    has undefined dependency 'DCC_CHECK'

So, after some digging around I wandered over to /usr/share/spamassassin/20_net_tests.cf and set “meta DIGEST_MULTIPLE” to “PYZOR_CHECK > 1” which seems to have fixed the problem (and knowing my luck and the amount of effort I put into this, which is somewhere near 0 it’ll come back someday to bite me in the butt (thus the blog post)) I was then able to run the following

/usr/bin/sudo -u vpopmail -H \
    /usr/bin/spamassassin -D bayes --lint
/usr/bin/sudo -u vpopmail -H \
    /usr/bin/sa-learn -u vpopmail --force-expire

And to add the following to roots crontab

0 0 * * * /usr/bin/sudo -u vpopmail -H \
    /usr/bin/sa-learn -u vpopmail --force-expire 1>/dev/null

I then reversed what I did here: http://blog.apokalyptik.com/?p=145 (well the spam part, I still have clam=no)

Here’s to hoping.

And regardless of whether it works or not, thanks to Shubes for the comment!

HA EC2 Part #2: Load Balancing the Load Balancer

Lets first address the problem of the dynamic IP address on the load balancer, because it doesn’t matter how good your EC2-side setup is if your clients can no longer reach your load balancer after a reboot. Also complicated because normally you want two load balancers to act as a fail-over pair in case one of them pops for some reason. Which means that we not only need to have the load balancers register with something somewhere we also need a method of de-registering them if, for some reason, they fail. And since downed machines usually don’t do a good job f anything useful we cannot count on them de-registering themselves unless we’re shutting them down manually. Which we don’t really plan on doing, now, do we?!

So here’s the long and short of the situation. Some piece of it, some starting point has to be outside the cloud. Now I know what you’re thinking: “but he just said we weren’t going to be talking about outside the cloud” but no, no, I did not say that; I said that we weren’t going to be talking about running a full proxy outside the cloud. I read that the EC2 team are working on a better solution for all of this, but for right now it’s in a roll your own state, so lets roll our own, shall we?

The basic building block of any web request is DNS. When you type in www.amazonaws.com your machine automagically checks with DNS servers somewhere, somehow, and eventually gets an IP address like this: 72.21.206.80. Now there can be multiple steps in this process, for example when we looked up www.amazonaws.com it *actually* points to rewrite.amazon.com, and finally rewrite.amazon.com points to 72.21.206.80. And this is a process we’re going to take advantage of. But first, some discussion on the possible ramifications of doing this:

DNS (discussed above) is a basic building block of how the internet works. And as such has had a dramatic amount of code written concerning it over the years. And the one type of code which may cause us grief at this stage is the caching proxy server. Now normally when you look up a name you’re asking your ISP’s DNS servers to look the name up for you, and since it doesn’t know it asks one of the primary name servers which server in the internet handles naming for that domain. once it finds that out it asks, a lot like this: “excuse me pdns1.ultradns.net, what is the address for rewrite.amazon.com?” to which your ISP gets a reply a lot like “The address for rewrite.amazon.com is 72.21.206.80 but thats only valid for 5 minutes.” So for 5 minutes the DNS server is supposed to be allowed to remember that information. So after 4 minutes when you ask again it doesn’t go to the source, it simply spouts off what it found out before. However after 5 minutes it’s supposed to check again… But some DNS servers ignore that amount of time (called a Time To Live (TTL)) and cache that reply for however long they feel like (hours, days, weeks?!) And when this happens a client might not get the right IP address if there has been a change and a naughty caching DNS server refuses to look it up for another week.

Alas, there is nothing we can do to fix that. I only mention it so that people don’t come knocking down my door yelling at me about a critical design flaw when it comes to edge cases. And to caution you: when your instance is a load balancer. It’s *ONLY* a load balancer. Don’t use it to run cron jobs, I don’t care if it’s got extra space and RAM, just leave it be. Because the fewer things happening with your load balancer the fewer chances of something going wrong, and the lower the chance of a new IP address, and the lower the chance of running into the above problem if the IP address doesn’t change, right? right!

So when you choose a DNS service you choose one which meets the following criteria:

  • API, you need scriptable access to your DNS service
  • Low (1-2 minutes) TTL
    (so that when something changes you only have 60 or 120 seconds to wait)

Ideally you will have two load balancer images. LB1 and LB2 (for the sake of me not having to type long names every time). You can do this dynamically (i.e. X number of load balancers off the same image), and if you’re a good enough scriptor to be able to do it, then HOW to do it should be fairly obvious.

When LB1 starts up it will automatically register itself at lb1.example.com via your DNS providers API. It will then check for the existence of lb.example.com, if thats not set then it will create it as pointing to itself. If lb.example.com was previously set it till preform a check (HTTP GET (or even a ping)) to make sure that LB2 (which is currently “active” at lb.example.com) is functional. If LB2 is not functional LB1 registers itself as lb.example.com. LB2 performs the same startup sequence, but with lb1 and lb2 switched where necessary.

Now, at regular intervals (lets say 60 seconds), LB1 checks the health of LB2 and vic a versa. If something happens to one of them the other will, if necessary, register itself at lb.example.com.

Well, I think that basically covers the portion of how this would work outside the EC2 cloud, next I’ll deal with what happens inside the EC2 cloud. (piece not written yet… so it’ll take a bit longer than the last two)

Fighting the good fight

Out comes my Knoppix 5 DVD, and into the machine I feed it. The local ntfs partition is mounted automagically. I mount my network samba share. Copy between the two… All my “stuff” which needs backing up before this machine can be memory wiped in a way that would make the creaters of the show Alias green with envy! Ok, so, not really… but still…

After that I’m gonna use the system restore to put the thing back together with some semblance of speed. And only then do I get to face the music: Hours and hours and hours of installing updates and software, and updates to software installed after updated to the system, and so on and so forth… and antivirus install, and antispyware install, and anti-windows install… or, I wish anyways.

And of course speed is “faster than as slow as it could be” because the bulk of the time is going to be spent in updates and copying my data back… the actual process of restoring the pc will be a minute amount of time compared to all of that…

And people wonder why I’ve switched to Mac OSX for my primary desktop!!

I really loathe windows sometimes

So, click on the wrong thing on accident and *POOF* your windows XP machine goes totally wonky. Major haywire. SNAFU to say the least. even with spybot’s teatimer and antivirus SW installed and updated. hundreds of registry access requests, and hundreds of stupid little exe processes launch. even with denying everything as quick as possible via teatimer something got through… and right now teatimer is fighting an automated war with some process which is trying to change my OS shell (presumably to install more spyware)… thousands of automatic deny notifications… I cant get to the process doing this in any normal way because “The system administrator has disabled access to the task manager”

?

The hell I have. So I do the obligatory update of adaware and spybot, reboot into safe mode (that’ll fix the little buggers, right?) well after manually setting permissions on locked portions of the registry both apps pronounce my system clean

wrong

reboot and the entire process starts all over again. I finally get the thing “clean” (quotes because it was STABLE not clean… like saying someone who’s got hepatitis is clean because there arent any symptoms today) So this morning I need to test a UI in IE. And WHAM

World War 3 

I think to myself “here we go again” and I stop. I’m not going to clean this thnig… the sad truth of the matter is that windows xp is such a vulnerable piece of garbage that it will literally be both easier, and take less time (!) to copy all my 100GB of stuff needing to be saved off onto a samba share, and wipe and reload… And thats counting reinstalling all my apps and drivers, etc! Truly a sad thing to have to say about your OS (easier to just trash the thing than fix it?! GO MICROSOFT!)

WOW! I’m 17 again at a LAN gaming party preforming the nights obligatory “system nuke” (there was always one wasnt there?!)

so, who wants to argue TCO today?