Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jdavidb (1361)

jdavidb
  (email not shown publicly)
http://voiceofjohn.blogspot.com/

J. David Blackstone has a Bachelor of Science in Computer Science and Engineering and nine years of experience at a wireless telecommunications company, where he learned Perl and never looked back. J. David has an advantage in that he works really hard, he has a passion for writing good software, and he knows many of the world's best Perl programmers.

Journal of jdavidb (1361)

Friday May 07, 2004
09:19 AM

Why can't I -B $scalar ?

[ #18667 ]

I want to be able to apply the -B test to the contents of an arbitrary scalar to see if it's binary or not. I've got files that are occasionally spewing junk at me; the first N,000,000 records may be just fine, but toward the end they turn into gibberish. I want to print out erroneous records, unless they are binary garbage, in which case I just want to print a statement that says so.

Thinking of looking into IO::Scalar or something...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • If I haven't been misinformed, -B uses the current STDIO buffer of the filehandle for its value, so you could seek near the end and possibly get a different result for -B as the file became "more binary".
    --
    • Randal L. Schwartz
    • Stonehenge
    • For the record, I decided to just match each line against \0 as I read it, and that seems to work fine for now. Not quite as advanced a heuristic as -B, but good enough.

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
  • The heuristics seem to be pretty simple... if you ignore fancy bits like Unicode, locales, EBCDIC, MS-DOG line endings, and accept also the vertical tab as whitespace, the test for -B is pretty much

    3 * tr/\0-\x07\x0e-\x1a\x1c-\x1f\x7f-\xff/\0-\x07\x0e-\x1a\x1c-\x1f\x7f-\xff/ > length

    That is, printable and whitespace ASCII and ESC are okay,
    others not, and if there are more than 1/3 not okays, call it binary.
    • That seems like the kind of thing that would be nice to expose in a function, in much the same way uc, lc, and glob started their lives.

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
      • > That seems like the kind of thing that would be nice to expose in a function, in much the same way uc, lc, and glob started their lives.

        I dunno... I think the heuristic is so weak (false positives for arguably "text" data, for example), and by definition a binary test (ta-dah!), as opposed to multivalued, that I find little value of exposing that logic. I think e.g. adding my snippet to the FAQ should be quite enough for those who need it.