Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Thanks for your work so far!

    I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser [google.com] which in any case doesn't separate the parsing and output phases.

    Given MediaWiki's horrible syntax [wikipedia.org] where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant probl

    • Hi!

      Thanks for your comprehensive reply.

      Thanks for your work so far!

      You're welcome, but I don't think I deserve a lot of thanks.

      I'm a massive fan of Wikipedia and other Wikimedia projects like Wikiquote, but I am very concerned that so much valuable data is being created in such a terrible format. I have been looking for other parsers, but so far have only found this Python MediaWiki parser [google.com] which in any case doesn't separate the parsing and output phases.

      Well there's wiki2xml [shlomifish.org] for MediaWiki (possibly based on the MW code) that converts MW code to XML. Maybe it's what you're looking for.

      Given MediaWiki's horrible syntax [wikipedia.org] where 'normal' wikitext, HTML and CSS can be liberally mixed together, I'm not surprised you had significant problems writing a parser.

      Still, I would like a Mediawiki parser, so I'm wondering if it's possible for the Perl Foundation to set up a fund specifically this project. I would gladly donate to it. Perhaps other people might be more motivated by funding.

      That may be a good idea.

      As an aside, I'm starting to wonder if Wikimedia content can really be licensed under the GNU FDL [gnu.org], as the license states that a transparent copy of the content must be provided. "Transparent" is defined as "represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors". Clearly it is possible to edit Mediawiki markup with a text editor, but there is no specification for Mediawiki markup. There are instead just several help documents [wikimedia.org] and a basic attempt [mediawiki.org] to create a spec. The FDL also adds "A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent". I wouldn't go quite as far as saying that Mediawiki markup is deliberately obfuscated, as clearly that wouldn't be compatible with the project's aims, but it certainly doesn't lend itself to quick and easy modification.

      Well, presumably, a human can take a document written in MediaWiki syntax and convert it to something more strict manually, so it may be OK.

      In the longer term, if Wikipedia is serious about being around in 100 years [wikipedia.org], I think they really need to produce a proper Mediawiki markup specification, then develop a standalone (read: non hacky PHP) parser against it and use this to convert Mediawiki markup to a more well-designed wiki markup language like wikicreole [wikicreole.org].

  • If worst comes to worst, you could always use PPI as a starting point.

    But that's probably a strategy of last resort, as the PPI codebase is fairly hideously complicated.
    • If worst comes to worst, you could always use PPI as a starting point.

      Isn't PPI a parser for Perl 5 code? How will this help me parse MediaWiki syntax?