Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

masak (6289)

masak
  (email not shown publicly)
http://masak.org/carl

Been programming Perl since 2001. Found Perl 6 somewhere around 2004, and fell in love. Now developing November (a Perl 6 wiki), Druid (a Perl 6 board game), pls (a Perl 6 project installer), GGE (a regex engine), and Yapsi (a Perl 6 implementation). Heavy user of and irregular committer to Rakudo.

Journal of masak (6289)

Sunday November 16, 2008
05:51 PM

November 16, 2008 -- the right man for the job

[ #37892 ]

624 years ago today, Jadwiga, a 10-year old girl, after two years of negotiations between her mother and the ruling lords, was crowned King of Poland.

Not that there's anything wrong with that. She appears to have been a just and respected monarch. Wikipedia:

As a monarch, young Jadwiga probably had little actual power. Nevertheless, she was actively engaged in her kingdom's political, diplomatic and cultural life and acted as the guarantor of Władysław's promises to reclaim Poland's lost territories. In 1387, Jadwiga led two successful military expeditions to reclaim the province of Halych in Red Ruthenia, which had been retained by Hungary in a dynastic dispute at her accession.

She died at the age of 25 from birth complications. Nowadays, she is venerated by the Roman Catholic Church as Saint Hedwig, and by others as the patron saint of queens, and of United Europe.

Been hacking on the MediaWiki parser today. Specifically, the code that finds == headings == and makes <h2>headings</h2> out of them. I've now implemented the easy test case, where the heading is to be found on its own, and not intermixed with ordinary paragraph text. Three tests remain to be satisfied in which it is.

Also spoke to Shlomi Fish (rindolf) today, who apparently got a grant for doing a MediaWiki parser, but got stuck. I asked him why he found the task hard, and he gave as an example the text a''b'''c''d'''e (or something equivalent), i.e. improperly nested style tokens.

I know about that problem. I have tests for it already.

In fact, a few years ago, I implemented an extremely reliable parser for a large subset of the MediaWiki syntax, but that time in Java. It had a very peculiar design goal, in that I never wanted it to fail with an error message, or with some other type of lack of output. Additionally, it sent the resulting HTML on to a set of XML transformers, so the resulting output had to be impeccable XHTML.

Think about it. The user can type any old broken, mis-nested, intentionally sadistic markup into the text box, and it still always comes out as freshly pressed valid XHTML. That's DWIM on steroids, some sort of "the user is right even when she's wrong" mentality. That module is still being used by dozens of people every day at my former employer. Of all the software I've written in my life, that one is perhaps the one I'm still the most proud of.

I'm not trying to brag, just showing that I have some sense of what I'm up against. The objective for this module is somewhat different: right now, I aim for bug-for-bug compatibility. If MediaWiki parses something in an incredibly stupid way, I want to do it too. I know it would be much easier, and probably more sane, to 'tidy up' the grammar while implementing it. But I don't want that; then it wouldn't be MediaWiki markup. One should be able to copy a text from a MediaWiki instance, and paste it in a November instance.

Come to think of it, I might have to make some small concessions if MediaWiki generates invalid XHTML in some case. In that case, valid XHTML takes priority. But hopefully, I'll still be able to emulate the way the page looks.

I look forwards to the thorny bits of the markup parser. I think PGE and me will have a great time vanquishing those windmills. ☺

I already have quite a few tests; but some still remain to be written. A few tests will surely be added when I find more corner cases. But all in all, I'm making good progress. Too bad I'm not getting a grant. ☺

First up is satisfying those mixed-heading-and-paragraph tests. That code will have to be sufficiently general, or at least generalized later, because lists, definition lists and possibly other things will behave the same way, i.e. line-orientedly. Then comes that issue with correctly handling mis-nested bold/italic. (And mis-nested bold/italic/links.) That will most likely require its very own blog post.

P.S. I'm not usually this cocky in my blog posts, but I wrote this immediately after watching a video podcast with Randal Schwartz. In it, he said that people don't know what you're good at until you tell them. I think he's right.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Never failing with an error message is what the currently most heavily used parsers do: the parsers the feed the HTML rendering engines in our browsers. Their goal is exactly that: always render something, never bail out with a parsing error.

    • Huh - don't know why I didn't think of the browsers when I wrote that. The browser authors must have problems a hundred times thornier than did I.

      There's no question that it would be simpler from a programmer's perspective just to refuse to render if some set of grammar rules are not strictly followed. In most systems this is actually a requirement from a safety perspective, but in visual rendering like HTML or wiki markups, the temptation to forgive and forget is strong... especially when several browsers

      • No one ever thinks of the browsers. :-) The most successful computing platform ever, by a yawning margin, and paradoxically enough the most casually overlooked one by just as yawning a margin.

        As for guessing vs catching fire, the problem in case of the web is that the user who gets to see the error is the one least capable of fixing it. So I don’t see how browsers could avoid lax parsing, even in an ideal world where almost all markup was valid (as opposed to the real one, where something like 99.99%