Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Have you looked at the source for TWiki?? Its regexp based as well. I realize that regexp based parsing/expanding is a primitive mechanism and painfully slow at times. However, its also pretty darn powerful as well. Im interested in your thoughts on how to make a better parser/renderer for something like TWiki a reality. --John Cavanaugh
    • by Matts (1087) on 2003.02.20 3:35 (#17280) Journal
      The problem with Twiki's parser (and all the other ones) is they all look something like this:
        $text =~ s/someformatting1/<somehtml1>$1<\/somehtml1>/;
        $text =~ s/someformatting2/<somehtml2>$1<\/somehtml2>/;
        $text =~ s/someformatting3/<somehtml3>$1<\/somehtml3>/;
        $text =~ s/someformatting4/<somehtml4>$1<\/somehtml4>/;
        $text =~ s/someformatting5/<somehtml5>$1<\/somehtml5>/;
      Which is great if you want HTML, but what if you want to parse the twiki text to put the data into a semantic search engine (where titles or bold text might have more relevance) - you have to parse once to HTML and then parse the HTML, and that I think is a broken model.

      The parsers should be written as frontend+backend - where the frontend basically tokenises the Wiki text and the (default) backend turns those tokens into HTML. But another backend might do something completely different.

      The model I'm talking about is that used by Text::WikiFormat::SAX, but people are very afraid of proper parsers (witness how long it took people to adopt to using HTML::Parser instead of regexp based parsers), because they mean you have to think of your source as data rather than text. Plus Text::WikiFormat::SAX is broken, mostly because nobody uses it so I have no incentive to fix it ;-)