Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

barbie (2653)

barbie
  {barbie} {at} {missbarbell.co.uk}
http://barbie.missbarbell.co.uk/

Leader of Birmingham.pm [pm.org] and a CPAN author [cpan.org]. Co-organised YAPC::Europe in 2006 and the 2009 QA Hackathon, responsible for the YAPC Conference Surveys [yapc-surveys.org] and the QA Hackathon [qa-hackathon.org] websites. Also the current caretaker for the CPAN Testers websites and data stores.

If you really want to find out more, buy me a Guinness ;)

Links:
Memoirs of a Roadie [missbarbell.co.uk]
[pm.org]
CPAN Testers Reports [cpantesters.org]
YAPC Conference Surveys [yapc-surveys.org]
QA Hackathon [qa-hackathon.org]

Journal of barbie (2653)

Friday October 25, 2002
09:25 AM

Fictious standards?

[ #8593 ]
I was dubious of the Caveats in the POD for Text::CSV when there was no reference to what standards or where the author had drawn his conclusions.

From MyFileFormats.com I found this CSV definition. Nowhere is it prejudice against non-US users of the format, so why does Text::CSV insist on:

Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde).

Nowhere in the specification I found (and it wasn't easy to find that!), does it make an assumption on what can be inside a field. As long as it's contained in quotes, it's valid. As it should be.

The reason I'm taking issue with this, is the fact we have a field in our CSV that is a currency field. As we are in the UK, we quite rightly use a £ symbol. Text::CSV spits it out as invalid, even if the field is contained in quotes as the specification states. According to Text::CSV specification, it also means that no european language characters, other currency symbols (eg the Euro or the Yen) or special symbols (eg ® or ©) are ever allowed to appear in a CSV file. I wonder if these producers of spreadsheets applications, with the capability of saving to CSV, realise they write out illegal characters?

Then again Text::CSV is over 5 years old and still at version 0.01! Seeing as the author hasn't written anything else, I wonder if they've disappeared?

Is this another module I'm gonna have to look at and attempt to patch? I seemed to be finding alot of inacurate or restricted modules of late!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • First, there's not really a "standard" for CSV. It really means whatever someone wants to throw at you. I had a project last year where multiple business partners would send me "CSV" data, and no two were the same. Some quoted every field. Some only quoted fields that needed it. Some escaped double quotes by doubling them. Some used backslashes. It was a mess.

    Second, don't use Text::CSV. Use Text::CSV_XS [cpan.org]. It's got far more parameters for your tuning enjoyment.

    --

    --
    xoa

    • I'm pretty sure Text::CSV_XS is the successor to Text::CSV. It's always a good idea to search CPAN [cpan.org] and look for more recent modules.

      For even more enjoyment, see if you can make use of DBD::CSV.

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
      • DBD::CSV was my first choice, however the file we are being sent contains additional record types, which include 1 or more comment records (a ' as the first character) and 1 header record (a # as the first character).

        Plus it was easier to parse the file directly rather than store it locally, parse it, then delete it.

    • But it still has the wierd notion of not allowing us to use our alphabet unless we enter binary mode, which disables any check on characters.
      Usefull, but I do have a hard time explaining why you have to use binary mode to write non-binary data!

      I would love for it to have an eight bit mode, where control characters are forbidden, ie. 0x00-0x17, 0x7f-0x97 and 0xff (if I got my ranges right). Of course this would annoy M$-users, that have some printable characters embedded in the high control range (0x80-0x9f
      • This was the issue I had with it. Why should I have to switch to binary just to use the extended character set? The fix I did, apart from clean up the bizarre nesting and blank lines helping to confuse the layout of blocks, was the following chuck added to the _bite() function, just before the last "} else {" line:

        } elsif ($in_quotes) {
            # an extended character in quotes...
            $$piece_ref .= substr($$line_ref, 0 ,1);
            substr($$line_ref, 0, 1) = '';

        Well it does the job

    • Text::CSV_XS seemed a bit too much overkill for what I wanted. I have my own patch to Text::CSV now, which handles the extended character set, provided they are contained within quotes.

      Your example still follows the standard as I understand it. Fields can have quotes around them, or the quotes can be omitted if the field doesn't contain the quote character or the field separator. The standard way of escaping double quotes is to double them. Much like SQL in that respect.

      • Text::CSV is quite a bare module, which will be updated *very* soon now.

        The new Text::CSV will include a pure perl version of Text::CSV_XS and will itself be just a wrapper. If Text::CSV_XS is installed, it will use it, otherwise, it will used the bundled Text::CSV_PP (or Text::CSV_PurePerl as the snap currently states).

        Text::CSV_XS is extremely faster than the pure-perl version(s).

        See also http://www.perlmonks.org/?node=617577 [perlmonks.org]
        --
        Enjoy, have FUN! H.Merijn