Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

runrig (3385)

runrig
  dougwNO@SPAMcpan.org

Just another perl hacker somewhere near Disneyland

I have this homenode [perlmonks.org] of little consequence on Perl Monks [perlmonks.org] that you probably have no interest in whatsoever.

I also have some modules [cpan.org] on CPAN [cpan.org] some of which are marginally [cpan.org] more [cpan.org] useful [cpan.org] than others.

Journal of runrig (3385)

Wednesday July 25, 2007
11:53 AM

More XML Sorting

[ #33883 ]
A few days ago I thought I had it all nailed, sorting the elements, but then I noticed this pattern in the XML (seems like a badly designed schema, oh well):

<a NAME="A">
<b NAME="B">
   <c NAME="FOO"/>
   <d NAME="FOO"/>
   <c NAME="BAR"/>
   <d NAME="BAR"/>
</b>
</a>

In one doc, the FOO's came first, and in the other, the BAR's came first. XML::Filter::Sort didn't handle sorting non-contiguous elements, nor sorting by element name. I briefly looked into patching it to handle that case, but decided against bloating a nice, simple, module API.

I remembered that XSLT could sort, so I started looking into that, and tried the first thing that came up when you search CPAN for XSLT, XML::XSLT. No matter what I tried, though, nothing worked. Then as a last resort I read the fine docs and saw that "sort" was not yet implemented (update: also, it seemed to be applying templates from the bottom up -- see xslt below)

Then I tried XML::Filter::XSLT which was based on XML::LibXSLT (and libxslt), which I had much higher hopes for. I had trouble at first getting it to sort by element name while preserving the attributes (which I could never quite find a full example of), but finally came up with this:

my $sorter = XML::Filter::XSLT->new(Source => {String => <<'EOT'});
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

<xsl:output method="xml" version="1.0"
encoding="iso-8859-1" indent="yes" omit-xml-declaration="no"/>

<xsl:template match="/">
  <xsl:apply-templates select="@*|node()"/>
</xsl:template>

<xsl:template match="/a/b">
  <xsl:copy>
    <xsl:apply-templates select="@*"/>
    <xsl:apply-templates>
      <xsl:sort select="name()"/>
      <xsl:sort select="@NAME"/>
    </xsl:apply-templates>
  </xsl:copy>
</xsl:template>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>
EOT

One interesting effect was that the encoded characters in the attributes (

&#xA;, &#xD;, and &#x9;

), were now coming out as unencoded characters, where previously they were just coming out as spaces. I'm still not sure what the best way would be to preserve the encoding, if I happened to care about preserving them...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • XML parsers do that entity parsing. Some people will say that it doesn't matter whether the XML contains &amp; or &, because your text isn't actually an object in the physical universe, but rather an abstract representation of a Platonic unicode document. We dirty Perl programmers get to put up with that crap, too [perl.org].