Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Alias (5735)

  (email not shown publicly)

Journal of Alias (5735)

Thursday December 15, 2005
12:17 AM

Day 1473: PPI now line-noise compatible and going into pugs!

[ #28001 ]

PPI was always intended to parse into a document object anything that looked like Perl.

But in the theoretical universe of all possible Perl programs there are a great many things that don't look like Perl at all, such as s<foo>'bar';

If this is a replace regex, you can't have unbalanced seperators for the second half, and you have to have both halves before any more tokens like a single quote...

So I was never very confident about being able to parse absolutely everything, just those that looked like Perl.

As it turns out, and as the other language communities have been telling us for years, Perl can look an awful lot like line-noise.

One of the main things that came out of my hosting Audrey (formerly Autrijus) Tang at my place for 3 days was that she wants to use PPI to parse Perl 5 into a "well formed" tree, and then convert that PDOM tree to PIL, and from there to allow execution of the Perl 5 code directly on the Perl 6 infrastructure, without having to bundle the Perl 5 interpreter.

This is quite a good deal harsher than anything I'd planned for PPI, and so we added the 20_tokenizer_regression.t and 21_exhaustive.t test scripts. Now I can run PPI against hundreds of thousands of completely random line noise strings.

Because Perl can already look like line noise, it turns out it was much easier than I ever expected to make PPI handle actual line noise! Only about 30 hours of coding to purge the dozen or so problems, mostly relating to broken regexs and bad ( DIE DIE DIE!!! ) here-docs.

Here's a quick look at the output of 21_exhaustive.t.

ok 1303 - String parses ok "<f\$t=Vf[?X/y\t\\`\"\@t?/9+w_1,:}\ts~~\\{\n 8W({`1:=-+'&wx_|%W+x,wW\@c\#\\]%~_<!}^0q!s?<9`9WX\$'|bX.q:m}!*;;q1c8_V-\"1'_f c/?\n]<=![\tq'}<WV9\\X\"y\"w x\nx!%0[>}f\#[%~ Vr:1{/10,)\\|:%:=yX^g%-'\tqr`&g,X)t*\$1-]\#\tmg"
ok 1304 - %PARENT is clean after destruction

Even though it might parse some of the line-noise wrong, it now at least can turn it into a tree and round-trip it ok. The crash rate has fallen to 1 per 200,000 120-character documents, and now the corner cases are so obscure I don't think you could hit them unless you were actually trying.

It also follows the parsing rules Audrey explained to me, which is that even if it is illegal, it should make it past the tokenizer OK, and the parser should only die at the lexing (deriving meaning) stage.

So as of 1.107 PPI is ready for full-blown use both in Perl editors and in pugs as some form of a front-end to PIL. Which brings up the internesting concept of parsing into PIL and back out to Perl 5 again, to "normalise" (in code terms, not document terms) and optimise your Perl 5 code. eep!

Which just goes to show (again) that most of the time it's the ways you don't expect your code to be used that often turn out to be the most significant and why it's important to have your APIs clean and neutral before you start releasing.

1473 days ago when I started what would become PPI I certainly never imagined real-time in-editor parsing and running Perl 5 code inside alternative virtual machines. I just wanted a way to morph already-valid Perl 5 inside a code-generation pipeline.

In other news, while in the process of chasing down the parser bugs I though I might as well also try to track down my personal bug-bear in the code, that is holding back CPAN::Metrics and thus sexy new CPAN-analysis things. Namely, PPI leaks somewhere... and it adds up to quite a bit when you are parsing 100,000 documents at a time. Until now I had suspected leaked circulars not being cleaned up in the %PPI::Element::PARENT hash that maintains child-parent weakrefs, but only when a document parsing crashed.

The testing for these problems are the "ok 1304 - %PARENT is clean after destruction" bit in the testing output.

But after having the PPI parser crash several thousand times now during the exhaustive testing process, I haven't seen a single leaked circular reference.

So my suspicion has shifted to an as-yet-undiscovered leak in one of the XS-based support libraries I use, Scalar::Util, List::Util, List::MoreUtil, or possibly a leak in a closure (of which I have 1 total).

Since they already found leaks in List::Util::first and the MoreUtils functions that copied it, I find it a lot more likely.

It seems to amount to a leak of 1.1k per document, although I don't have any hard numbers yet on whether this varies according to the size of the documents.

Are there any hardcore leak-ninjas out there? I'm out of my depth now and could use some help finding this bastard. If anyone wants to try, go into the t/21_exhaustive.t in the PPI tarball, change $ITERATIONS = 1000; to 1000000 and then run the test and watch what happens...

I'm getting so frustrated, I'm now tempted to throw up another vertical-metre-of-beer award if it hangs around for any further releases...

Update: It would appear it isn't PPI leaking after all! If I run the processing but disable the actual tests memory isn't leaking.

So rather than PPI, I blame the testing infrastructure. Something is leaking somewhere in Test::More, Test::Builder or Test::Harness...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
    • Nope, he says it's too slow to do syntax highlighting in real-time with.

      And he'd be right. I've said that from the start, although there are some potential techniques for windowing that might help.

      PPI would be used in background mode. Parsing and analyzing in a background process and feeding event and feedback objects back into the editor.

      The speed problem won't go away until PPI gets 4 times faster than it is now. But the "can accept any old shit" problem is solved now.

      Which means we are go for background
  • I don't believe there's a serious leak anywhere in your 21_exhaustive.t (with or without Test::More), at least that's what valgrind tells me.

    I suspect the primary cause of the memory growth you're seeing is a feature (not a leak) of Test::Builder. If you peek at the implementation of Test->ok() in, you'll see it stores information about each test executed, so that it can do its reporting at the end. Storing this information seems to consume around 500 bytes per test.

    I stumbled across