Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ziggy (25)

ziggy
  (email not shown publicly)
AOL IM: ziggyatpanix (Add Buddy, Send Message)

Journal of ziggy (25)

Thursday December 09, 2004
11:53 AM

Java vs. Perl (with a side of tag soup)

[ #22233 ]
I'm looking into tagsoup at the moment, writing a module that uses its HTML normalization algorithm on top of HTML::Parser. The original code is written in Java, not exactly clear, and full of state transitions that are somewhat obscure. For example, the list of open tags are maintained as a stack (fair enough), and when it's time to process a close tag, the first thing to know is whether we should process it or ignore it. Here's the source, in Java:

//...

Element sp;
for (sp = theStack; sp != null; sp = sp.next()) {
    if (sp.name().equals(name)) break;
}
if (sp == null) return;    // unknown etag, do nothing
//....

Here is the same operation, expressed more naturally in Perl:

## This end tag closes a tag that isn't open.  Ignore it.
return unless grep m/$name/, @$stack;

Granted, these are philosophical and sylistic differences. My quibbles could be with the language, the generally accepted idioms for Java programming, or with the author of this code. (Actually, I don't have any issues with the author; just being complete and highlighting the possibilities. ;-)

Regardless of my personal differences, this example highlights the benefit of writing clear, concise code. Using the C-style for loop obscures the intent by micromanaging the problem and focusing on the mechanics. The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).

You may look at this and say So what? It's just a different style preference. In the small, you're somewhat correct. However, I've been looking at this code for some time now, and these little differences accumulate to complexify a program from something that should be easy, and turning it into something hard to express and hard to understand.

Not the best way to write something that people should understand, and only incidentally for a computer to execute...

 

PS: Here's the tagsoup algorithm in a nutshell:

  • Use a very liberal scanner to read the input character by character; use a schema driven lexer to find tokens in broken markup
  • Include a schema for HTML that describes what a tag can contain, and a nominal parent for every tag
  • Convert the tokens from the lexer into SAX events, adding events as necessary to convert the input into well-formed output; ignore unnecessary tokens when they appear (non-HTML elements, extra close tags)

The real power is in the last bit: the algorithm to add extra events to produce well-formed output. Here's how that works:

  1. Maintain a stack of open tags
  2. For each open tag, look up the stack to find a tag that can contain this tag
  3. If this tag cannot be placed under any tags in the stack of open tags, look for a tag in the stack that contain this tag's nominal parent. Remember this tag, and repeat the process with this tag's parent. Repeat as necessary.
  4. Knowing which open tag will serve as the parent to this sequence of tags to open, unwind the stack (if necessary) to close all open tags until the parent tag is at the top of the tag stack.
  5. Open each tag in the list of tags to open (from step 3). After each tag is opened, reopen any of the preserved tags (from step 4) that this tag can contain.
  6. Close tags as they are found; if close tags are missing, issue the necessary events to produce a well-formed document. Ignore unnecessary close tags.
  7. (Optionally) Ignore all non-HTML tag events.

The algorithm is somewhat simple, can always succeed (or just ignore a tag in woefully broken HTML), and work in a streaming fashion. Unfortunately, this streaming algorithm has no lookahead, so it will deform your input file to create something vaguely approximating your input in a well formed fashion, even if it does not render correctly. For example, a sequence of table, tr, form, td will result a sequence of table, tr, /tr, /table, form, table, tr, td -- not what you started with, but that was broken anyway.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • In one of his talks (Enterprise Perl?) James Duncan discussed readable code and gives the excellent advice that every loop should be a method. I find myself doing this more with Java than Perl, probably because I hit the mental ceiling for method length with Java's verbosity. So I'd typically translate your example to a method like:

    skipUnopenedTags(theStack);

    One side-effect of Java's not having unless is that I tend to write both 'isSomething' and 'isNotSomething' for readability, especially because an

    • I think that might be the case sometimes. I also think that there are people that just don't care. I've asked someone at work why they wrote this:
      my$x=$test>5?'foo':'bar';

      vs

      my $x = $test > 5 ? 'foo' : 'bar';

      And they literally didn't see the difference. It's quite depressing.

      -Dom

  • In a similar application, I keep a hash of open tags along with a stack. This is a simple ++/-- thing, but it makes the given snippet even more readable:
    return unless $tags{$tag};
    (literal paste)
  • Can you please explain, maybe show an example, of what you mean with step 3? It's just a mystery to me now.
    If this tag cannot be placed under any tags in the stack of open tags, look for a tag in the stack that contain this tag's nominal parent. Remember this tag, and repeat the process with this tag's parent.

    Do you mean to say that if you find a <LI> tag without an enclosing <UL> or <OL> , you insert one of those?

    • Right.

      If <LI> cannot fit within the current open tag (such as <A>), walk up the stack of open tags until you can find a spot where you can open it. If, for example, you find an open <OL>, close all open tags until the <OL> is at the top of the stack. (Presumably, that means you forgot a </LI> somewhere, since they cannot nest.)

      If there is no spot in the stack where you can deform it to open a <LI> tag, try to open the sequence <OL> <LI>, and find the cl

  • The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).

    In Perl 5.10 we get the smartmatch, so the mechanics will be completely hidden:

    return unless $name ~~ @$stack;

    In any case, did you ever finish porting the thing to Perl? If not, can I still have the code you have so far?