So I finally got my grubby little hands on a copy of Database In Depth by C. J. Date. I've just finished Chapter 3 and so far, I've been blown away. My irritations with the shortcomings of databases have been wrong. They're generally shortcomings with SQL and people's (mis)understanding of relational theory.
One bit which fascinates me is Date's arguments against NULLs. It fascinates me because it really hits home regarding some problems I've had but it also irritates me because I've read these arguments before and not paid much attention to them. It's time for me to change that. Now, if I reject his reasoning, I want it to be because I've thought things through, not because I wasn't paying attention.
Part Date's rejection of NULLs stems from how terms are defined: an attribute (column) in a tuple (row) in a relation (table) must contain a name and a type. Since NULLs, by definition, are a not a type, it follows that an attribute must not have a NULL value.
Of course, without an explanation of the theory, many are going to argue that Date is just using wordplay to define NULLs out of existence. To deal with this, Date comes up with the following example:
| SNO | City | | PNO | City |
| S1 | London | | P1 | NULL |
"SNO" is "Supplier Number" and "PNO" is "Part Number". If these tables are called "supplier" and "part", then fetching all part numbers is trivial:
This should be logically equivalent to the following:
WHERE City = City;
Of course, this query returns no rows because you cannot logically compare "unknown" values. This is fundamentally the simplest example one can give which demonstrates how NULL values can give erroneous results. However, the query is so ludicrous that many would be inclined to dismiss it out of hand. After all, we'd never write anything so useless, right?
Moving along, we need to write a query where we need to know where either the supplier and part cities are different, the part city is not Paris, or both. Here's one way to write that query:
SELECT S.SNO, P.PNO
FROM supplier S, part P
WHERE S.City <> P.City -- first condition
OR P.City <> 'Paris'; -- second condition
Because we cannot compare NULL values and the only "part" has an unknown city, we get no rows. But is this correct? Either that unknown city is Paris or it isn't. If it is Paris, the first condition should evaluate as true and thus we'll get a result. If it is not Paris, the second condition should evaluate as true and we should get a result. Thus, logically we should get a result even though the query returns no rows.
The mere existence of a NULL in the database means we may get erroneous results in our queries. However, the more complicated the queries, the more difficult it will be to tell when we'll get erroneous results and, I suspect, the more likely it is when we will get erroneous results if the NULL columns are used in the WHERE clause.
These points really resonate with me. While I've certainly found three-value logic attractive (I don't like perl -le 'print "yes" if (undef) < 4' printing "yes"), it does appear to be problematic in the above example. I'm beginning to think that three-value logic may have a place in procedural or OO code, but perhaps not in declarative code. More than once I've worked on a large systems where NULLs were causing problems in queries that were often very difficult to track down.
What I'm wondering is what the counter-examples are. Such a fundamental issue shouldn't be controversial unless there are clear counter-examples which demonstrate why we do need NULLs. The "that's the way we've always done it" argument doesn't hold much sway with me because I view that akin to "hitting my thumb with a hammer is better than hitting it with an axe" -- just don't hit your damned thumb, will ya?
Note: avoiding the "NULL" problem can be dealt with via the technique presented in How to Handle Missing Information without Using Nulls (pdf) by Hugh Darwen.