Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

runrig (3385)


Just another perl hacker somewhere near Disneyland

I have this homenode [] of little consequence on Perl Monks [] that you probably have no interest in whatsoever.

I also have some modules [] on CPAN [] some of which are marginally [] more [] useful [] than others.

Journal of runrig (3385)

Wednesday August 22, 2007
04:27 PM

XML parser errors

[ #34181 ]
A colleague was developing a PowerCenter transfer with an XML file as the source. She kept getting an error, with a reference to a line number in the file. But there didn't appear to be anything wrong with the indicated line. I ran the file through libxml (via perl), and it gave me a different line number as the error. Then the error was obvious...the encoding claimed to be UTF-8, but there were characters such as ë and Ä in the file. Changing the encoding to ISO-8859-1 seemed to fix it, I'm not sure yet if the supplier of the file will fix it, or if we'll have to fix their tag-soup gunk (there is as yet no perl involved in the process, so I'm not sure if Grant's Rule applies). I went to google to see about any other info with regard to PowerCenter, XML, and line numbers, and not far from the top of the list was my own posts here on use.perl. Now with this post, I may show up even higher :-)
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • After informing the source that the file was not valid "UTF-8" encoded XML, they sent another file. Same problem. This time I told them exactly what's wrong..."These characters need to be encoded like this...". I'm sure they're just pushing the "export as XML" button on whatever tool they're using (something called TIBCO) and sending what comes out. We'll see what happens next.
    • Hmm, I am no XML nor encoding wizard, so I wonder why this works without error (the xml file has a utf-8 encoding declaration, and characters above 127 ascii are not encoded):

      my $file = "file.xml";
      open(FH, "<:encoding(iso-8859-1)", $file) or die "Error: $^E";

      my $p = XML::LibXML->new();
      • Because the file is Latin-1-encoded, and if you open it like that, then Perl will decode from Latin-1 to characters as it reads the file, so libxml2 will never actually see Latin-1.

    • Look at using iconv (or it's Perl equivalent piconv) or perhaps recode.

      But to be quite honest, if it's not well formed, send it back and tell them to sort it out. If it's not well formed, it's not XML.

      • Oh yeah, I know it would be easy to fix on this side if I had to, but I did manage to get correctly encoded files from them. Their UTF-8 encoding is broken, but ISO-8859-1 and US-ASCII works.