In a recent Perlmonks thread, I managed to tick a lot of programmers off by pointing out some problems with SQL ignoring set theory. I showed how what superficially appear to be a bunch of logically equivalent queries return different results. However, they all return the correct results if the "DISTINCT" keyword is used.
Unfortunately I used an unnormalized schema as an example in order to keep things simple. Many programmers focused on this. When I write things "off the cuff", I have an annoying tendency to think that providing a simplistic example will make the problem clearer. This is often true, but people frequently focus an obvious problem and ignore the larger ones. I sometimes do this myself. On the plus side, if I never posted anything controversial, that would suggest that either I'm following the herd or I'm not thinking enough, so on with the controversy!
The problem is that SQL returns bags of data instead of sets. Rarely do we ever want bags. Some argue that we really are getting sets instead of bags because "under the hood" the query engine is fetching off all of the data and merely masking out the data we don't want. The problem with this logic is that most people merely want to use SQL, not worry about the fiddly bits inside. Do you really want to write more garbage collection code in C? I don't want to have to stop at every query and wonder about whether or not I need to throw a "DISTINCT" in there. So the counter-example I provided was this a fictional story about an accountant who should be fired.
So a salesperson walks up to an accountant and says "Bob, for our customers with excellent credit ratings, what cities do they live in? I need to plan flights through those cities."
"What do you mean by 'excellent credit rating', Alice?"
"For the sake of argument, let's include everyone with a credit rating greater than 700."
"OK, let's see
... there's London. Oh, and there's London. And Paris. And London. London again, Athens, London, Paris, Moscow, London and Paris."
"That's a hell of a flight."
Needless to say, you'd be pretty irritated if your accountant did that. However, that's the equivalent of the following SQL:
FROM city, customer
WHERE city.id = customer.city_id
AND customer.credit_rating > 700
To make that do what you probably want, you need a "SELECT DISTINCT" in there. That's what SQL should do by default, with an optional "NODISTINCT" or something similar in there.
Ben Tilly offered an interesting rebuttal. He mentioned how he was once a junior programmer bitten by Microsoft Access's default behavior of returning distinct results. He was summing receipts, but Access, having DISTINCT on by default, was discarding duplicate values, not duplicate results. That's silly. It's like saying all guys named "Bob" are the same guy. The Microsoft programmers had the right idea and the wrong implementation. Further, since so many Access databases I've worked on have been poorly designed, I suspect this was a common problem.
For this to work properly, DISTINCT should discard duplicate elements from the same tuple. If you truly have duplicates from different tuples, you have a denormalized database and you need to either normalize it or adjust your query to take this into account. (For example, a customer table might list "CITY" instead of "CITY_ID").
The argument I hear against this is an all too common one. Every piece of data should theoretically be tagged with the data value and the tuple of origin (and also the data type, but the reasons behind this are beyond the scope of this post). This means that queries would be more memory intensive and thus slower.
Does this sound familiar? We usually call this "premature optimization". Database vendors have focused so long on making things fast and worried about making things correct as an afterthought. The original relational theory proposed by Codd and evangelized by Date did not have this problem, but SQL won the query language wars and we seem to be stuck with it.
Can anyone provide a counter-argument? What I mean, specifically, can you argue that the default behavior should be to return true duplicates? Do we really want bags more often than sets? Obviously, the following query might return apparent duplicates if we had a properly implemented "DISTINCT" on by default:
SELECT total FROM order;
However, that apparent duplication would only be there because you really can have multiple orders totalling to $100.00 (yes, that's probably an unnormalized example. Imagine that I referenced an ORDER_ITEM table and summed the line items.)
Frankly, I can't imagine why anyone would want bags of results other than the obvious fact that most databases frequently fail to optimize queries with "DISTINCT" in them. But that's telling me I'm supposed to worry about the implementation details again. I just want to focus getting the results I want, not how I'm getting them. SQL is a declarative language and those languages' greatest strength is that they are result oriented.