Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

TorgoX (1933)

TorgoX
  sburkeNO@SPAMcpan.org
http://search.cpan.org/~sburke/

"Il est beau comme la retractilité des serres des oiseaux rapaces [...] et surtout, comme la rencontre fortuite sur une table de dissection d'une machine à coudre et d'un parapluie !" -- Lautréamont

Journal of TorgoX (1933)

Friday August 16, 2002
08:50 PM

HTML-Tree, and checksumming

[ #7137 ]
Dear Log,

A little whoopsie of mine in HTML::TreeBuilder basically broke version 3.12, and yet didn't cause any of the HTML-Tree tests to fail. Michael Koehne is a superstar because he spotted this and told me.

So I rushed out a new version today (3.13) , with some more and smarter tests that will stop things like this from happening again.

Most (but not all) of the new tests each take two bits of HTML and making sure that they parse to isomorphic parse trees. Given a wrapper function same, the tests are mostly like ok(same( '<ul><li>x<li>y</ul>after' => '<ul><li>x</li><li>y</li></ul>after' ));.

One thing that Michael Koehne suggested is ensuring continuity across versions by having tests that basically take a bit of HTML, parse it, dump the parse tree as text, and run a checksum on that text. Then the test consists of making sure that that checksum stays the same across different HTML-Tree versions. He suggested MD5 for the checksum algorithm; but I'm hesitant about using it, since that would mean making HTML-Tree have a dependency on the MD5 module. Maybe I'll just make the tests skip on sites that don't have the MD5 module intsalled. Anyone have other suggestions?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • MD5 checks are going to be identical iff the two inputs are identical (for all practical purposes).

    If you don't want to put the MD5 of the canonical version in the test case, why not put the stringified Data::Dumper value in the test? No CPAN dependency that way. :-)

    • Well, you'd have to either eval it back in and do some deep comparison, or use the 5.8ism of Data::Dumper::Sortkeys.

      I'd say use the MD5. If they don't previously have it, it will at least mean their CPAN.pm will start using it.
      --
        ---ict / Spoon
  • is a core module in 5.8.0 just in case you hadn't noticed that.

    • Just for the sake of pedantry ;) and in case someone doesn't know better, the MD5 module is deprecated, and Digest::MD5 is in 5.8.0.
  • He suggested MD5 for the checksum algorithm; but I'm hesitant about using it, since that would mean making HTML-Tree have a dependency on the MD5 module.

    Why not use the unpack() checksum: $sum = unpack "%32C*", $string;


    • Because it doesn't catch transposition:

      DB<1> sub csum { unpack "%32C*", $_[0] }

      DB<2> x csum "+abc-"
      0 382
      DB<3> x csum "-abc+"
      0 382
      DB<4>