As I poke errantly at my book, I at times sense that I am skirting around a Big Issue, or at least as big as they get in computing, without dragging in things like "is there such a thing as free will?". One of the Big Issues I found myself being drawn toward was the dichotomy of semantic versus presentational markup -- namely, is it a strictly binary dichotomy?; is each side just the negation of eachother?; should we insist that systems be all-and-fully one or another?; and so on.
After having done some amount of reading of philosophers' and linguists' treatments of Big Issues in their own respective fields, I have come to think that taking a Big Issue and saying "I will now write a brilliant analysis of how to solve this problem, consisting of about 700 pages of dense type" is absolutely the worst thing one could possibly ever do. If the problem were tractable to one person sitting down and figuring it all out, it wouldn't likely be a Big Issue, now would it? Instead, warm up to the problem a little at a time, briefly and obliquely, and maybe what you say about it "can become better in other people's minds than they were in yours" (to quote Eno).
Below, hopefully illustrating this approach (or at least what I'll in retrospect rationalize as an attempt at that approach), is a brief extract from the draft of my book. This bit (and this bit alone) involves a Big Issue, semantic versus presentational markup. It's rather atypical of the rest of the book, which is very very much about how to do things, shot thru with codeblocks.
Why Data Extraction is Hard
You may have noticed that some simple-sounding data extraction tasks sound simple to carry out, but sometimes they turn out to be surprisingly difficult. For example, if what you're extracting is "every image on this page", that does not sound particularly easier or harder a task than if what you're extracting is "every headline-link on this page". But as we saw in the examples above, matching every image on the page is simple, whereas matching every headline-link on the page is often quite hard. This problem is caused by the fact that HTML provides one good construct for representing images, the <img...> tag, which means an image and looks like an image. But HTML provides no construct that means "a bunch of headline text that's a link to a story", and so if designers want something that looks like such a piece of text, they're likely to just use whatever presentational elements (bold, font, etc.) to get that. The codes they use will look like a headline, but they will not mean "this is a headline". This distinction is usually referred to as the difference between "presentational" markup, where you use codes that merely trigger appropriate formatting, versus "semantic" markup, where you use codes that express that a block of text is of a class called "headline" (and which incidentally may have styles of formatting associated with it). There are some very big philosophical and practical problems with this distinction, which basically come from the fact that meaning (semantics) is always a more complex concept than appearance (presentation).
For example, if web designers pay attention to the World-Wide Web Consortium's guidelines, they cook up a stylesheet (presumably in CSS) that represents all the important meaningful distinctions that they anticipate using in a prospective set of documents. That stylesheet will express that each kind of distinction should formatted in a particular way -- that a "headline-link" will have these particular font features, these particular kinds of spacing, and so on. Then whenever you want to write a document with a headline link, you use that style; and whenever you want to process documents in some way that means having to distinguish their headline links from other things, you can refer to these styles. That makes for markup that's more semantic than simply using a bunch of presentational tags like <b> and <font...> but it's not the end of the problem.
First off, what happens if the stylesheet has to change over time? If the designers decide that the original stylesheet fails to distinguish the various kinds of sub-headlines that need different formatting, they can produce an amended stylesheet and start using it. But if your task is to extract data from a set of documents, some produced with the original stylesheet, some with the revised stylesheet, you might have to basically treat these as disparate kinds of documents, each kind requiring different extraction programming.
Second off, the designers' decisions are basically driven by what kind of distinct formatting is needed. While this is a basically sane approach, it has the disadvantage that semantic distinctions that don't affect formatting, won't be made. While everyone in a newsroom would agree on the semantic difference between a newspaper article that's simply reporting events ("Five Injured, Lathered in Soap Factory Accident") and one that's simply opinion ("Andy Rooney: Who Likes Liquid Soap Anyway?"), there's likely to be no necessary formatting difference when you're linking to each from the front page of the web site. So opinion pieces will get the same headline-link style as event reporting. That's a problem if you're trying to write a program that collects just articles that report on events while ignoring articles that are opinion columns, editorials, or letters to the editor. It's tempting to shrug this off as an "AI problem", but that's often just a blanket term for anything we don't know how to do yet.
After a long time dealing with data extraction tasks, I regretfully say that I see no grand universal solution to this problem; instead, every data extraction task will require a program unto itself, with little in common with any other data-extraction program. The program that pulls headlines from BBC News's mainpage will likely have nothing to do with a program that pulls headlines from ABC News's mainpage. (And occasionally there may even be surprising differences between independently written programs that extract from the same source.)
Many kinds of programming are learned first as just a bunch of particulars, and then over time, the programmer learns to see over-arching patterns in whatever subfield he's trying to learn. This is how the programmer comes to feel that he understands the problem, and then feels confident in dealing with particular instances of that problem. But the task of pulling data out of HTML doesn't work like that; since every extraction task is an ad hoc program unto itself, the closest we come to computational enlightenment is just a familiarity with the tools (like HTML::TokeParser, or other HTML parsing modules) and fluency with the process of writing programs that use them.