Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

TorgoX (1933)


"Il est beau comme la retractilité des serres des oiseaux rapaces [...] et surtout, comme la rencontre fortuite sur une table de dissection d'une machine à coudre et d'un parapluie !" -- Lautréamont

Journal of TorgoX (1933)

Thursday September 26, 2002
05:43 AM

RTF in extremîs

[ #8025 ]
I spent all day working on Pod::Simple::RTF. I think the basic Pod::Simple framework is now quite mature; I only made two or three minute changes to it in the whole course of writing Pod::Simple::RTF.

Most of today's fiddling with the RTF thing was adding heuristics that almost no-one will notice, but which make things pretty -- things like "codeblocks of under 15 lines shouldn't be split across pages", and "A reasonably short heading followed by a paragraph shouldn't be split across pages". These aren't exactly trivial things with a tokeparser interface, but I pull 80/20 cheats here and there -- so under some very rare cases, a "keep this together with the next paragraph" code won't get generated. But no big deal; it's just hardcopies of docs.

Benefits of a tokeparser framework: it's fast! It parses AND formats perlvar in under a second. And that's even as it has to go construct and destroy a few thousand objects along the way. (A point where I diverge from HTML::TokeParser's approach is that I have tokens be actual objects, with accessors, not just bare arrayrefs.)

I also spent forever figuring out how to express Unicode characters in RTF -- horror itself, and shoddily supported even in MSWord 2000, but it's better than just doing s/[^\x00-\xff]/X/g;. I've got to at least try, since there's not much of an alternative.

I rather wish RTF allowed comments. I.e., things that a RTF processor (or rather, the RTF processor, since using anything but MSWord to interpret RTF can be pretty dicey) would discard, but which could be used for the equivalent of "<!-- and now we start the thing that might not work -->" or whatever. It'd just be handy for debugging.

Anyhow, the RTF thing is almost done (I've only got left the bit that automatically marks things that look like code as being not spellcheckable), at which point I won't have to think about RTF in any real detail for quite a while.

By the way, s/([^\x00-\xFF])/'\\uc1\\u'.( (ord($1) < 32768) ? ord($1):(ord($1)-65536) ).'?'/eg;
That's what turns Unicode characters like "\x{4E4B}\x{9053}" into their RTF representation, «\uc1\u20043?\uc1\u-28589?». Whee!

In other news, I'm coming to the slow and grudging realization that my book isn't terribly bad.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
    • Well, once this RTF thing is done, then I think I should finish the Pod::Simple documentation, then see about a Pod::Simple::Man. I don't really know *roff, but I'll see how far I get by just retrofitting the current Pod::Man.
  • Is bloody good. It's one of those books that changes the way people work. After editing it, I was a lot more comfortable with using the web as a data source (and sometimes as a data sink). I couldn't have put together OSCON without it.


  • (A point where I diverge from HTML::TokeParser's approach is that I have tokens be actual objects, with accessors, not just bare arrayrefs.)

    Never having been a fan of HTML::TokeParser's arrayrefs, I wrote HTML::TokeParser::Simple []. It provides the accessor methods I wanted and makes the code much easier to read. Want to know if a token is a starting or ending form tag? With HTML::TokeParser, you do this:

        if( ('S' eq $token->[0] or 'E' eq $token->[0]) and 'form' eq $toke