Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

  (email not shown publicly)

I work for MessageLabs [] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Wednesday February 19, 2003
06:07 PM

Rant against wiki parser writers...

[ #10674 ]

This rant is inspired by me just looking at the source code for Text::Tiki, but I certainly don't intend to single that parser out...

Why is it that wikitext [1] parser writers foist upon us broken crappy regexp based parsers that break at the slightest deviation from the spec, don't treat the document as a structure and truly believe that they are just parsing "text" and only need to produce "text". What if I don't want to produce HTML but actually want to *do* something with that data?

Repeat after me: "s///g is not a parser!"

Some day I'm going to find time to finish off Text::WikiFormat::SAX and show people what its all about.

Rant over.

[1] And this includes WikiText, TikiText, UseModText and all the various flavours that I've had the misfortune to look at the source code of.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Must admit to being an idiot in that it took me quite a while to figure out what thingy to click to reply to this.

    We talked about this on IRC and I agreed with you. I went and looked at HTML::Parser, which I think we figured out was a good model for what we want to be able to do with WikiText. It's all in XS, so I ran away. Bad move? Me no spik C.


  • by jcavanaugh (1007) on 2003.02.19 20:42 (#17260)
    Have you looked at the source for TWiki?? Its regexp based as well. I realize that regexp based parsing/expanding is a primitive mechanism and painfully slow at times. However, its also pretty darn powerful as well. Im interested in your thoughts on how to make a better parser/renderer for something like TWiki a reality. --John Cavanaugh
    • Powerful? Sure, in that it can do a lot. No, in that it doesn't help the software understand the structure of the data at all. Regex is a very limited language.
    • The problem with Twiki's parser (and all the other ones) is they all look something like this:

        $text =~ s/someformatting1/<somehtml1>$1<\/somehtml1>/;
        $text =~ s/someformatting2/<somehtml2>$1<\/somehtml2>/;
        $text =~ s/someformatting3/<somehtml3>$1<\/somehtml3>/;
        $text =~ s/someformatting4/<somehtml4>$1<\/somehtml4>/;
        $text =~ s/someformatting5/<somehtml5>$1<\/somehtml5>/;

      Which is great if you want HTML, but what if y

  • Three Reasons (Score:3, Insightful)

    by chromatic (983) on 2003.02.19 21:05 (#17261) Homepage Journal
    • Writing a proper parser is hard
    • We're not as smart as you are
    • A terrible regex implementation that gets the job done pretty well today is a heckofalot better than a beautiful, perfect event-based parser that isn't here yet

    I'm only about halfway kidding.

    • I agree with all your points but the second one ;-)

      Since Text::WikiFormat::SAX doesn't work properly I'm obviously not as smart as you think I am ;-)

      But in all seriousness this is something I hope to put right, in a similar way to trying to put right the whole XML parser nonesense. That way I can put my code where my rant is, or something like that.
  • I might almost forgive them if they implemented the same regex, but each person implementing a different regex based "mini-language" is the worst.

    Maybe its not too late to get gnat to include recipes on using Parse::Yapp, and Parse::RecDescent in the next cookbook.