Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I had a long reply using Text::Unidecode here, but use.perl.org *really* doesn't want to format things the way I want it to (half the time it seems to double-encode my unicode, and never do multiline code or pre tags!), so I'll try using words instead of pictures to explain what I'm trying to talk about. First, the easy question: How are those alternate readings sorted? It doesn't seem to be by first ms with that reading, nor by number of readings -- is it just hash order? Second, the hard question -- wh
    • half the time it seems to double-encode my unicode

      Put posts and comments through “encode 'us-ascii', $your_post, Encode::HTMLCREF”. That will make them come out as intended.

      and never do multiline code or pre tags!

      That’s on purpose; Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you; 2. it uses <pre>, so very long lines will wrap

      • Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you;

        This is the part that doesn't play nicely with UTF-8, actually, although the <ecode> tag is almost always what I want - the Armenian characters get converted into entities upon comment submit, and those entities themselves have their ampersands turned into entities upon ecode conversion.

        • The conversion to entities is your browser’s doing, actually. It sees that the form should be submitted in ISO-Latin1, so it turns all the non-Latin1 characters into entities. Slashcode can’t actually know that you didn’t mean to send them that way. There is therefore no way to get around this.

          All you can do is use plain <code> tags with <br> tags for linebreaks, sequences of &nbsp; for tabs, and manual escaping for ampersands and less-thans. It’s a pain to do manually, but a tolerable amount of work with a good editor.