Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

Ovid
  (email not shown publicly)
http://publius-ovidius.livejournal.com/
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Monday July 03, 2006
09:23 AM

Why People Hate Normalization

[ #30154 ]

Imagine that I have tasks and users tables. Each task has a start status, and eventually has either a cancel or finish status. Each status must be tagged with the user.

I can't put the cancel or finish status in the tasks table because since they will be NULL to start with, I can't put a foreign key constraint from, say, the task.cancel_user_id column to the users.id column since foreign key constraints typically don't allow null values. Thus, to properly normalize them, I need at least two other tables, task_cancel and task_finish, each with at minimum a task_id and a user_id (and probably a date). This assumes that tasks entered are automatically started. If not, that needs a task_start table. So a conceptually simple problem now requires a minimum of five tables, tasks, users, task_start, task_cancel and task_finish.

That gets annoying and one might be tempted to create a fake user called "none" and insert that user id in the tasks table, but that's just special casing a particular ID and overloading its meaning. That's an even worse problem as most SQL queries which fetch data from the users table will need to account for the special case.

So getting back to the five tables, what if we want to fetch all of our data at once? If a task is neither cancelled or finished, we can't do an inner join against those tables, we have to do an outer join. So once again the rules of normalization take a relatively simple idea and start to complicate it to the point where programmers want to throw in the town and skip normalization.

Mind you, these sorts of "real world" problems crop up all the time. Did I miss something here? Shouldn't simple ideas be simple? It might seem like ORMs can hide some of these problems, but they add some of their own problems (link taken from a journal entry of TeeJay's).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I think you've overnormalized. You don't need task_start, task_finish, task_cancel junction tables. You've just traded trying to enforce UserID NOT NULL IF Status NOT NULL for enforcing that a task appears at most once in the UNION of the three task_{start,finish,cancel} tables, which will be even more awkward to code in your data dictionary (1=COUNT(*) FROM UNION...)

    You need just one task_status table that has NOT NULL FKs to TaskID and UserID, plus a State enumerated type and an autoinsert date timesta

    --
    Bill
    # I had a sig when sigs were cool
    use Sig;
    • Ah, thanks. You're correct that my normalization wasn't correct. I can now see the value of what you propose and that does make the task simpler and more maintainable. As a caveat, though, some of what you suggest implies using a real database, not a toy like MySQL (which is what we're using).

      This gives me another good rule of thumb to keep in mind. Naturally, when I see things like field_1, field_2, field_3 and so on, it's obvious that I need another table as what I'm looking at is essentially an arr

      • You have to use a real programming language too, not a toy like Perl.

        (Oh, I meant Perl version 2. What? There's a newer version? I bet it's a toy just like Perl 2.)

      • As a caveat, though, some of what you suggest implies using a real database, not a toy like MySQL (which is what we're using).

        Can you upgrade your MySQL to the latest & greatest? Yes, MySQL3 with default engine MyISAM is a toy. MySQL5.[01] with InnoDB engine (or BDB, or the coming Falcon engine) is no longer a toy -- subquery, views, transactions. (They've even got a cluster engine for partitioning.) [ Engines [mysql.com]] (I think this is the point of the Perl 2 joke.)

        Barbie is correct that normalizing is hard

        --
        Bill
        # I had a sig when sigs were cool
        use Sig;
        • Bill nails it. I stopped having any sympathy for people who refuse to upgrade to new versions of free software at approximately 7 am on Sunday morning.

          • Bill nails it. I stopped having any sympathy for people who refuse to upgrade to new versions of free software at approximately 7 am on Sunday morning.

            What happened Sunday? I don't see a journal entry for Sunday. Is there a new entry for stupid customer tricks?

            Memo to self: update home servers to Ubuntu to 6.06LTS and Debian to 3.1 now that they're really stable ...

            --
            Bill
            # I had a sig when sigs were cool
            use Sig;
            • I read yet another Nicholas Clark "This isn't fun anymore, and no one seems to want to pay someone to fix Perl 5" message on p5p. That's when I decided that companies which rely on Perl 5 but don't send bug reports or test snapshots and have all of their code in the Darkpan can fix their own bugs, or at least pay for the continued development of Perl. That is, they can do that if they care about their code.

              • "companies which rely on Perl 5 but don't send bug reports or test snapshots and have all of their code in the Darkpan can fix their own bugs, or at least pay for the continued development of Perl. That is, they can do that if they care about their code."

                That hits home.

                Support: I tried to get commercial support for Perl, and couldn't. Yes, I could pay for it, but I couldn't get anything I could recognize as support. I talked to ActiveState at LinuxWorld 2005, interested in buying support fo

                --
                Bill
                # I had a sig when sigs were cool
                use Sig;
    • Bill pointed you in the same direction I was going to. Statuses accumulate over time, and you're typically only interested in the most recent one, unless you're rolling back time for audit purposes. But if you're stepping into an area where auditing might be a concern, you have to start paying attention to the distinction between "actual" time and "record" time (a distinction that can tie your brain in little knots when you try to think through queries). Martin Fowler has a decent intro to the issues [martinfowler.com].
  • I can't put the cancel or finish status in the tasks table because since they will be NULL to start with, I can't put a foreign key constraint from, say, the task.cancel_user_id column to the users.id column since foreign key constraints typically don't allow null values.

    While current relational theory is pretty contentious when it comes to NULL, the fact is that most databases that I'm aware of, including PostgreSQL, MySQL, and SQLite, allow foreign key columns to be NULL.

    —Theory

    • I was doing this in MySQL and having NULL values in the FK column caused MySQL to die a horrible death. And then just to prove my point, I created a small test case in MySQL and it didn't die. There must have been something else going on, but now I can't reproduce it. Damn.

      • I was thinking the same as n1vux all along while reading the post – there just isn’t any need for so many separate join tables, one is perfectly sufficient given a start/cancel/finish enum column. And I’d still do it with such a join table even if NULL FKs work: for one thing, such an enum column for would be naturally extendable without having to touch the schema. For another, you don’t ever need to UPDATE any rows – apart from better scalability, having immutable rows tends t