Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Shlomi Fish (918)

Shlomi Fish
  shlomif@iglu.org.il
http://www.shlomifish.org/
AOL IM: ShlomiFish (Add Buddy, Send Message)
Yahoo! ID: shlomif2 (Add User, Send Message)
Jabber: ShlomiFish@jabber.org

I'm a hacker of Perl, C, Shell, and occasionally other languages. Perl is my favourite language by far. I'm a member of the Israeli Perl Mongers, and contribute to and advocate open-source technologies. Technorati Profile [technorati.com]

Journal of Shlomi Fish (918)

Sunday September 30, 2007
10:25 AM

The MediaWiki Parser Grant

[ #34570 ]

As you may know by reading the Perl Foundation blog, I was awarded a grant to work on a MediaWiki parser. Now, I knew about this grant a long time before it was announced, because it took a long time from the time the committee decided to give me the grant, until it was announced on the foundation's blog.

I've started working on some working code. However, as I've made only slow progress. There are several reasons for it:

  1. Recently, I've been relatively lethargic. Being out of job, and without motivation to do anything, I don't seem to have the will to get things done. Most of the time, I just rest, play games, read emails and RSS, etc. but not really code.

    The prospect of getting the money in return is not enough of a motivation to work on the parser.

  2. This is an annoying task. So far the code I wrote, handles only a small subset of the syntax, but is already very complicated, monolithic, and "ugly". The MediaWiki syntax is highly irregular and I find that handling all the edge cases while outputting a well-formed stream of tokens, is hard.

  3. It's complicated. Like I said, the syntax is highly irregular, which makes it a hard task. So I may feel intimidated by it, and as such de-motivated even more.

So to sum up - I've neglected working on it. There's still a substantial amount of code I've written with many extensive tests, but it still covers only a very small subset of the syntax. If someone wishes to help with this work, I'll be grateful to help him by giving him a commit access to the repository. But I don't feel very motivated to work on it myself.

I've been thinking of doing something to compensate for that. I'd like to help squash Archive::Zip bugs, but still need repository access. I'll also like to resume work on Test-Run, but may possibly need to re-implement it more directly above TAP::Parser and TAP::Harness. I've also been neglecting work on File::Find::Object, and can resume it. This was an alternative grant proposal that I submitted along with the MediaWiki parser grant.

I can also help resolve random bugs from rt.cpan.org, or, unrelated to Perl, dedicate more time to be a Linux kernel janitor. (Which I'm trying to do also because I hope it will help me find a job.).

In any case, I hope you're not too disappointed from my lack of willingness to work on the MediaWiki parser. I guess you can't always be successful at what you're trying to do.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Thanks for your work so far!

    I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser [google.com] which in any case doesn't separate the parsing and output phases.

    Given MediaWiki's horrible syntax [wikipedia.org] where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant probl

    • Hi!

      Thanks for your comprehensive reply.

      Thanks for your work so far!

      You're welcome, but I don't think I deserve a lot of thanks.

      I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser [google.com] which in any case doesn't separate the parsing and output phases.

      Well there's wiki2xml [shlomifish.org] for MediaWiki (possibly based on the MW code) that converts MW code to XML. Maybe it's what you're looking for.

      Given MediaWiki's horrible syntax [wikipedia.org] where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant problems writing a parser.

      Still, I would like a Mediawiki parser, so I'm wondering if it's possible for the Perl Foundation to set up a fund specifically this project. I would gladly donate to it. Perhaps other people might be more motivated by funding.

      That may be a good idea.

      As an aside, I'm starting to wonder if Wikimedia content can really be licensed under the GNU FDL [gnu.org], as the license states that a transparent copy of the content must be provided. "Transparent" is defined as "represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors". Clearly it is possible to edit Mediawiki markup with a text editor, but there is no specification for Mediawiki markup. There are instead just several help documents [wikimedia.org] and a basic attempt [mediawiki.org] to create a spec. The FDL also adds "A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent". I wouldn't go quite as far as saying that Mediawiki markup is deliberately obfuscated, as clearly that wouldn't be compatible with the project's aims, but it certainly doesn't lend itself to quick and easy modification.

      Well, presumably, a human can take a document written in MediaWiki syntax and convert it to something more strict manually, so it may be OK.

      In the longer term, if Wikipedia is serious about being around in 100 years [wikipedia.org], I think they really need to produce a proper Mediawiki markup specification, then develop a standalone (read: non hacky PHP) parser against it and use this to convert Mediawiki markup to a more well-designed wiki markup language like wikicreole [wikicreole.org].

  • If worst comes to worst, you could always use PPI as a starting point.

    But that's probably a strategy of last resort, as the PPI codebase is fairly hideously complicated.
    • If worst comes to worst, you could always use PPI as a starting point.

      Isn't PPI a parser for Perl 5 code? How will this help me parse MediaWiki syntax?