Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • That's a very clear explanation of what I've thought for some time, but was unable to phrase.

    A quick way to upgrade yourself from level 2 to 3 is reading http://juerd.nl/site.plp/perluniadvice [juerd.nl] ;-).
    • Thank *you* for the good tutorial.

      So, I've been thinking there need to be some standards for CPAN modules to declare if it accept/return strings or bytes. (If they need to handle both)

      For instance, HTML::Parser has an instance method called utf8_mode [cpan.org].

      Another example (that triggered me to write this entry) is Catalyst's uri_for() method [cpan.org]. At some release the developers changed the implementation to accept only strings (UTF-8 flagged or not) in its %query_values hash.

      Based on the complaints and patches made by
      • Strings or bytes is not the right distinction, because both kinds are strings. I usually call them "text string" and "binary string", or "character string" and "byte string". Sometimes I call the former "Unicode string" to emphasize that all text strings are Unicode strings.

        A trap is the UTF-8 string, which is a byte string representing characters, and has "the flag" off (which to perluninewbies is confusing because this flag is called UTF8). Compare this with the result of pack "N*", LIST, which is a byte
        • Hm, just to clarify, I prefer to use characters vs. bytes like you say. If I sometimes use "strings" somewhere, it's just a slip of keystrokes, or I meant Unicode strings instead.

          And also, I'm a bit afraid that you misunderstood what I meant with mention to bytes.pm. I didn't mean we should call "use bytes" in this situation to force string operations to be bytes-wise. Not at all.

          I meant declaring "use bytes" *might be* a good way for programmers to tell the module authors "Hey I want this module to do what
          • "use bytes;" is lexical: it cannot influence what a module does. I don't know who to thank for this, but I'm happy that at least my code won't be broken at a distance by the numerous uninformed and misinformed people who throw a "use bytes" at their code to replace one kind of (for them) vague behavior with another kind of vague behavior. :)

            Experience has show so far that the only workable way of supporting both byte strings and text strings in your function, is to provide two separate functions, or a mechanism to indicate what kind of string you're passing. My BLOB thing would be a standardized way of saying "this is a byte string, not a text string" that is very probably drop-in compatible with existing code.

            With the BLOB you're effectively saying "I DON'T WANT UNICODE HERE", but you're still dependent on the module author to comply. Fortunately, scanning documentation for the word "BLOB" is easily done :)
            • Agreed in both: we should use two different functions to accept characters or bytes, and also BLOB.pm would be useful to DWIM. :)