The RDBMS Misconception That Less is More

It’s commonly held that normalization is a good thing. And it is. But like all good, or more to the point TRUE, things there are circumstances in which the opposite hold true.

The “proper” way to layour a database schema is something as ever changing as the tides. Rather like the US justice system we find that things which once held true no longer do. Or that things which were once absolute do, actually, have extenuating circumstances under which they arent — exactly — absolute.

The proper way to lay out an RDBMS system is to look at a very simple ratio: Space VS Speed. The less duplication of data in your database the more efficient (in terms of space disk used) it is. In exchange for that dis space savings you incur the cost of additional disk seeks.

For example, if you’re keeping track of your users information (e.g. who’s registered and who hasnt) You might use a table like this:

Users: |  userId | firstName | lastName | eMail | cryptPasswd |

But in all likelyhood you’re going to have a lot of users with a common first and last name! Normalization to the rescue (or so it seems — at first)

Users: | userId | firstNameId | lastNameId | eMail | cryptPasswd |
FirstNames: | firstNameId | firstName |
LastNames: | lastNameId | lastName |

Now, instead of storing the string “John” a thousand times for the thousand users with the first name of John, you store the string once, and you have an integer field which related (the R in RDBMS) to a normallized list of names.

But… the cost is that now any time you want to pull a name from the table it requires 3 lookups.

select firstNameId,lastNameId from Users where userId = 1
select firstName from FirstNames where firstNameId=x
select lastName from LastNames where lastNameId=y

Where the same would have been done with the following query before normalization

select firstName, lastName from Users where userId=1

It gets worse when you’re computng values based on information stored in your tables. For example if you are looking for the number of times a user has visited a certain page, so that you can show them the information on the page they are viewing (or perhaps to do some checking on that value each time they visit to prevent, for example, site mirroring). You might already be storing what people are doing on the site in a table called UserActionLog for debugging, tracking, or statistical purposes. And you use the data in that table to run reports on a, say, weekly basis.

You COULD use something like this to gather the information about the user each time they visit a page:

select count(pageId) from UserActionLog where userId=x and pageId=y

But you will probbaly find that duplicating this data is a much more cpu effective, though disc inefficient, way of solving the problem. Storing something like this in a new table would yeild a much faster result for something which will be accessed continuously

PageVisitsByUser: | pageId | userId | totalVisits | lastVisit |

Now is this always going to hold true? Well no. The places you’ll find where it doesnt matter are the places in which you have WAY more resources than your needs require. For example you only have 100 users, and get hits on pages which require database access rarely. Applications like this dont need optomization because the advancing state of computing hardware *IS* the optomization that they need.

However as you process more and more volume you’ll find time and time again that a 1/1000 second per hit advantage is an 11.5 DAY (1,000,000 seconds) savings for 1 billion hits… even with only a million hits thats a 16 minute per day savings. You can see how the savings stacks up when you start adding in powers of 10

Thats the real challenge of the Web2.0 movement. Finding the amount of data versus the need to use that data which hits the sweet spot. What can we do with what we’ve got that people want?. I’d argue that as warfare in the 20th century was defined by gunpoweder, Web2.0 is a battle defined by its data schema

Leave a Reply