Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

masak (6289)

masak
  (email not shown publicly)
http://masak.org/carl

Been programming Perl since 2001. Found Perl 6 somewhere around 2004, and fell in love. Now developing November (a Perl 6 wiki), Druid (a Perl 6 board game), pls (a Perl 6 project installer), GGE (a regex engine), and Yapsi (a Perl 6 implementation). Heavy user of and irregular committer to Rakudo.

Journal of masak (6289)

Monday July 06, 2009
09:06 AM

Str and Buf -- I think I get it now

[ #39236 ]

So Str and Buf aren't merely there in Perl 6 to separate out the two related concepts "sequence of characters" and "sequence of bytes", respectively. They're there to institute a whorfian discipline where it won't even be possible to think the wrong thoughts, about strings and sequences of bytes.

This is the natural consequence of working Joel Spolsky's "There Ain't No Such Thing As Plain Text" into the language design. By all means, treat real strings as Str, but make sure byte sequences can't cross that moat without being decoded somehow. Perl 5 totally blurs the distinction, as moritz++ explains. It is not alone among programming languages in that respect. In fact, I'd be interested to hear about some language that makes the same Str/Buf distinction as Perl 6.

It took me a while to reach this point, but now it seems perfectly obvious. I remember, only a few weeks ago, being shook by TimToady's claim that Strs do not know their byte sequence in the general case. But I see it now. A Perl 6 string is not a sequence of bytes. It's a sequence of characters, at least by default. Likewise, a Buf is not a sequence of characters, not even metaphorically. It's a sequence of integer values. And the difference isn't some picky play with words, but the encoding/decoding step itself.

This is the act of building knowledge into the class hierarchy of the language itself, so that people's thoughts will be channeled in the right direction. "Arrgh, why can't I get the number of bytes on this Str object? Oh look, the manual says I have to convert to Buf to do that. Oh look, I have to supply an $encoding parameter to do that. I don't see why, but fair enough — if that's the price for using all the other cool stuff in Perl 6."

The battle of encoding-aware programs is won, not necessarily through making the programmers aware of encodings, but to make the language provide primitives that Do The Right Thing.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Just a quibble, but I believe it is more correct to say a string in Perl 6 is made up of graphemes rather than characters. The two Perl 5 strings "\x{F6}" (LATIN SMALL LETTER O WITH DIAERESIS) (and "\x{6F}\x{308}" (LATIN SMALL LETTER O and COMBINING DIAERESIS) should be the same string in Perl 6 (unless the codes pragma is turned on).

    • Yes, you are right. The Perl 6 spec is full of references to "characters", but in a few places it mentions that this term defaults to being the same as "graphemes".

      I think my growing familiarity with Unicode is not yet at the stage where I immediately reach for the term "graphemes". :) Maybe some day.

  • This "you can't think the wrong thoughts" target reminds me of the novel Babel-17 by Samuel Delaney which uses the same sort of concept but in a human language that causes the person who thinks in that language to automatically think the right thoughts in a very powerful way.
    • I think the idea is sufficiently old. Umberto Eco details various attempts made during the years in his book The Search for a Perfect Language [amazon.com]. Newspeak in 1984 [amazon.com] has words selectively pruned from it so that dangerous thought becomes difficult or impossible.

      • I think it is strongly related to the Sapir-Whorf Hypothesis - http://en.wikipedia.org/wiki/Linguistic_relativity
  • So if I want to find a “magic number” byte sequence in a binary file, how do I do that in Perl 6?

    • According to S32/IO [perlcabal.org], the return type of slurp (which reads a whole file at once) is Str|Buf. A Buf is returned when a parameter :bin is passed to slurp. After that, you can treat the Buf you get as an array (because Buf does Positional, and do as advanced indexing operations as you need to find your byte sequence.

      I wish I could show this with real, working code, but Buf isn't implemented just yet in Rakudo.

      • What kind of pattern matching facilities does Buf support?

        • The spec is a bit silent on that point, so I asked on #perl6 [perlgeek.de]. The conclusion seems to be "convert it to a string if you want to pattern match".

          Then again, if smartmatching with list semantics is what you're after, that should work. Something like $buf ~~ (*, 104, 101, 108, 108, 111, *) to find "hello" in an ASCII-encoded Buf.

  • In Java, a java.lang.String and a byte[] have nothing in common, and there are no (non-deprecated) ways of converting between them without specifying an encoding. It's one of the very few features I like about Java…
    • There are (at least) two things wrong with Java's encoding support:
      1. No way to avoid UnsupportedEncodingException, even for UTF-8, which is guaranteed to be present.
      2. The concept of a "system default encoding" is flawed and leads to bugs in portability. You should be forced to always specify an encoding.
        1. true
        2. I thought all methods that converted without specifying an enconding were deprecated… anyway, yes, implicit encondings are a very bad idea

        and while we're at it,

        3. internal enconding is utf-16, and it's visible at the language level, so that the "length" method gives you completely useless information

        • Very true about the 16 bit character. Thankfully, it's less of a problem for me right now.