Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

toma (3098)

toma
  (email not shown publicly)
http://tomacorp.com/

Journal of toma (3098)

Wednesday December 25, 2002
10:09 PM

Order of tags in XML files

[ #9629 ]
One of the things I look for in an application that stores its data in XML is whether or not the application fails if the ordering of the tags in the XML is changed. Some applications that use XML format don't seem to even look at the values inside the tags, they just throw out everything between < and > and instead depend on the order in the file to determine the meaning of the data. I sure don't want this to happen to any of my applications!

I've been using XML as a file format to represent CAD data. I translate from a proprietary format from a CAD vendor into XML, do something to the data, then write it back out in the proprietary format. I want to make sure that if I happen to reorder the XML tags in this process that I don't create invalid data when I write it back out, because the proprietary format has order-dependency.

I can imagine a few ways to handle this, and as usual there is a speed/memory/program-complexity tradeoff.

I am using the most excellent XML::Twig module, using the online tutorial and the O'Reilly book, Perl & XML. One thing the docs are a little thin on is examples of using the XPATH capabilities of XML::Twig. Here is an example that I made:

use XML::Twig;

my $fn= 'traces_small.xml';

my ($tag, $att, $value) = ('NET','name','/P5C');

my @example;
push @example, sprintf('TRACES/STFIRST');
push @example, sprintf('%s', $tag);
push @example, sprintf('%s[@%s]', $tag, $att);
push @example, sprintf('%s[@%s="%s"]', $tag, $att, $value);

foreach my $pattern (@example)
{
  print "Matching XSLT expression $pattern\n";
  print "TwigRoots\n";
  my $xml= new XML::Twig(
    TwigRoots => {$pattern => 1},
    error_context => 1,
  );

  $xml->set_pretty_print('indented');
  $xml->parsefile($fn);
  $xml->print;
  print "--------------------\n";
}

foreach my $pattern (@example)
{
  print "Matching XSLT expression $pattern\n";
  print "start_tag_handlers, original_string\n";
  my $xml= new XML::Twig(
    start_tag_handlers => { $pattern =>
      sub
      {
        print $_[0]->original_string,"\n"
      }
    },
    error_context => 1,
  );
  $xml->parsefile($fn);
  print "--------------------\n";
}

Now I want to make a module to create my CAD data in a certain order, independent of the order of my input data.

Approach 1
Put tags on the data that say what order they should be in.
I don't like this approach because I want my XML format to work for different CAD tools that have different requirements for the order of their data. One of the main purposes of the XML format is to have it be CAD tool independent. So order properties are right out.

Approach 2
One pass per section
In this approach, I would parse the XML file as many times as needed, each time printing only the next section of the CAD data. Example: if it were HTML, I might have one pass for the header and one pass for the body. This is easy to code and takes a minimal amount of memory, but it is CPU intensive.

Approach 3
One pass, store data in an array
Here I would store the data into elements of an array. At the end of parsing the array would be printed out to the file, and the order of the elements in the array would take care of the ordering of the output file. In the HTML example, $array[0] would hold the header and $array[1] would hold the body. This approach is memory intensive, since I have to store all the CAD data in an array.

Approach 4
One pass, store the data in an array of files
This approach is like the array, except each element of the array is an open file handle, and the data gets printed to the different files. At the end of the processing the files are appended to each other. This approach is file-handle intensive, and is probably not a good idea when there are, say, 100 or so file handles open at a time. This type of approach tends to run into the kernel parameter for the maximum number of open files for a process.

Since the trend these days is to throw RAM at problems, I'll try approach 3 first, and perhaps have an option in my code to use approach 4 or possibly 2.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.