Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ddick (5726)

ddick
  (email not shown publicly)

I'm based out of Melbourne, Australia. I attend the excellent melbourne.pm.org meetings whenever i get the chance, which is not often enough.

Journal of ddick (5726)

Sunday November 02, 2008
03:55 AM

Multi-byte Unicode and PDF

[ #37784 ]

After an excellent talk on the Encode module by Stephen Edmonds and a close encounter with the µ symbol, i've been playing with the various unicode symbols a lot.

A missing component seems to be putting multi-byte unicode into a PDF document. Try this little beggar 狗 on for size.

PDF::API2, htmldoc and html2ps all seem to have problems with characters when the encodings uses more than 1 byte to represent 1 character.

Anyone know a tool that can do the job?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I spent a few days just this week pinning down a bug in PDF::API2 that prints some characters from Unicode strings (in this case, just the Polish character set) on top of each other.

    I nailed it. I just have to feed it back to its maintainers.

    I have not tried it with your character, I don't even have a clue as to what character set it fits in (something CJK, but that's it) and thus, what font I ought to use.

    • I spent a few days just this week pinning down a bug in PDF::API2 that prints some characters from Unicode strings (in this case, just the Polish character set) on top of each other.

      I nailed it. I just have to feed it back to its maintainers.

      Just stumbled over the same problem, using Croation characters. Do you have a patch/solution/workaround? If so, can you send it to me?

      • When is the last time that you updated PDF::API2? Because it is fixed in the latest release on CPAN (0.72).
        • Indeed. I first tried Debian's current package, which has $PDF::API2::VERSION set to 2.015, just like the current CPAN version, so I thought it's the current one. But Debian's libpdf-api2-perl is only based on 0.69.
  • The maintainer of PDF::Reuse [cpan.org] accepted my patch to add this functionality earlier this year.

    It's my understanding that if you stick to the built-in PDF fonts you're stuck with characters in the Latin-1 range (roughly speaking). You have to use embedded fonts to get at Unicode characters outside that range.

    • That's correct. Appendix D of the PDF Reference explicitly lists the minimum glyphs that must be supported in the 14 standard fonts.

      That said, I would not be surprised if non-Latin-1 Unicode characters worked fine in one of the basic fonts on a recent mainstream OS. To get Unicode in strings, you may need to employ to the hex notation (angle brackets).