Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

toma (3098)

toma
  (email not shown publicly)
http://tomacorp.com/

Journal of toma (3098)

Tuesday January 21, 2003
02:57 AM

Templates for programs and modules

I have released two new perl template generators. These programs create the basic structure of a new module or program. This saves typing and makes it easier to get started.

The program for modules is similar to the results of h2xs -aXn, but it does not create Makefile.PL or the rest of the nice framework created by h2xs.

I use my program in conjunction with h2xs, replacing the file module_name.pm with the output from my program.

I welcome comments on the code and ideas for enhancements.

I wrote these while procrastinating on developing a new module. I want to:

  • Improve my SQL by practicing on SQLite.
  • Translate some of the XML structures that I have been working on into an SQL schema.
  • Compare the approaches of using XPath queries on an XML dataset with a relational database approach.

My shiny new templates will make this at least 1% easier :-).

I plan on doing schema development using SQLite perl module, then porting the work to a big-iron server. I like to minimize my work on the big machine, except for developing queries. The nice part about using DBI will be that I can develop the schema on SQLite, and then port easily. I'm looking forward to seeing how well this approach will work!

It should work perfectly the first time! - toma

Thursday January 16, 2003
03:34 AM

XML::Twig and XML::Filter::Dispatcher tie in speed

Performance Comparison Between SAX XML::Filter::Dispatcher and XML::Twig with test data is available.

Previous measurements were thrown out due to pilot error, and the new results reveal that the speed of the two modules is nearly identical in my application.

I learned a lot in the process, and I will be writing more about this topic in the future.

Monday January 13, 2003
12:12 AM

Comparing XML::Twig and XML::Filter::Dispatcher

Comparing Twig and Dispatcher
I rewrote my XML::Twig program to use XML::Filter::Dispatcher in order to compare the approaches. I compared the simplicity of the code necessary to do the job, and the speed of execution.

The result was that XML::Twig ran 17 times faster, which surprised me.

The Dispatcher code was cleaner than the Twig code. This is because I was able to remove the code I wrote to get my Twig return values to come out in the correct order. The order of the data from Dispatcher worked the way that I had orgininally hoped that Twig would work.

The speed is a big deal for me, because the Twig code is actually already slower than I would like it to be. The Dispatcher code is probably not fast enough for my application. I'm tempted to write the code again and use a format other than XML to see how fast it runs.

It would be nice if I had a program that would automatically measure the complexity of a perl program. I would like to be able to compare the complexity of the implementations with a numerical technique.

If anyone wants to see the two approaches and the test data, let me know and I'll post it on tomacorp (We're not a corporation).

New Module Testing
I installed and tried PerlBean, which looks useful for automating the generation of perl objects. Before I use it in a real project, I need to understand if there is a way to use it so that the classes can be redesigned without losing work. The straightforward way looks like you would have to edit the class by hand after the initial run of the module, and if you want to run it again you would have to cut and paste the custom methods in again.

Perhaps there is a way around this. PerlBean would make a good core for a perl IDE, I think.

I sent a bug report to the author of PerlBean. It looks like the tutorial didn't get an update after an API change.

Saturday January 04, 2003
01:41 PM

A new XML book and more XML modules

XML and Perl
Received and read part of "XML and Perl" (New Riders). This isn't a book review, leave that to people that understand the subject better than I do. These are just my notes!

The book is useful but does not provide much new info for me. It spells a few things out clearly that are otherwise hard to figure out, with line-by-line code walkthrough.

Here are a few of the gaps:

  1. It says that XSLT can be used to transform XML into CSV and other non-tagged formats, but the example code just shows the usual XML to HTML translation.
  2. The SAX code has a bunch of case statements in the handlers for which kind of tag is being processed. I would prefer to see this coded another way with SAX. I try to avoid big case statements.

I liked the section on XML Schemas, since I didn't know anything about them. I can't say that the book helped with today's particular problem of interest.

Trying SAX Modules
Installed Pod::SAX with cpan. This has many dependencies, including XML::SAX::Writer. Installed okay. Pod documentation for the functions is missing. This style of code doesn't look like the kind of code that I like write.

Installed XML::Generator::DBI with cpan. It failed tests looking for DBD/Pg.pm, there is some manual configuration that needs to be done. I didn't pursue this further because I have no database on this machine. The code is interesting to read though, both as an example of using XML::Handler::YAWriter and a nifty flexible DBI query.

Other activity
I installed psh and fooled around with it. It looks like fun, but possibly dangerous since I don't know what I'm doing. My shell needs to be very reliable, eg rm commands.

I installed File::List, and wrote and example program using File::Flat with it. I posted this snippet as an answer to a question at perlmonks.

Wednesday January 01, 2003
05:10 AM

XML: SAX and Twig, also TWiki

XML: SAX and Twig
I have been reading about XML::SAX, and it is starting to make sense. I am concerned about memory usage as compared to XML::Twig. I need to process an XML file that is at least 10MB and possibly as large as 100MB. I would like to limit the RAM usage to less than 256MB, although 512MB might be okay. I have a hard limit of 2GB of RAM, since I am using 32 bit perl. It looks like XML::Twig can be set up to work with SAX, so this might help to solve the memory problem.

These file sizes are spooky. I think of XML as one-big-happy-file-that-describes-a-thing. Perhaps "the-thing" is too complicated for a single file. If so, I will need a new approach. I may need to learn about namespaces or some other way to partition a large XML dataset.

I thought up a way to eliminate the redundancy in the XML reader/writer for my flat/lumpy files. I can have a data structure that specifies the flat file in XML. Redundant portions of the XML reader and writer can be generated from this file.

It would be nice if someone had already written this. There are many tradeoffs in the design of such a thing, and I don't want to get bogged down in it. I will look at some of the SAX drivers for non-XML data sources.

I think removing reader and writer redundancy will be worthwhile, since I have at least a dozen and perhaps thirty of these file formats to translate to and from XML. As my buddy Steve says, "Make things that are the same the same and things that are different different."

Twiki
One of the things I like about PerlMonks is that I get new ideas that have nothing to do with what I am working on. Today, for example, I downloaded, built, and ran TWiki. Suddenly I get it and I hope that I will be using TWiki for something that will be useful and yet disruptive. At work there is a large dataset of free-text startup content, which is duct-taped to the side of an exquisitely normalized database. This text is the output from an extensive ongoing collaboration. It looks like a great opportunity for a wiki.

The main challenge will be scalability. I plan on evaluating this within the next few months.

New Modules
I am still trying to get TWiki working for creating new users. I didn't have any email set up on the machine where I was running TWiki, and that seemed to be a problem. I got the email working, but I still have the same problem. I rebuilt perl 5.8.0 in the process, and updated a bunch of modules as recommended by the results of running r command in the cpan program.

Thursday December 26, 2002
04:32 PM

Flat file to XML round trip via XML::Writer and XML:Twig

I wrote a translator (described yesterday) that converts a particular flat file format to XML, and another from the XML format back into the flat file.

The flat file has a line-at-a-time format with the first token on a line determining the type of the data on the line. The lines are in a hierarchical data structure, with various first-position tokens specifying the hierarchy.

I made an object that contained an XML::Writer object and a hash of anonymous subs, where the key to the hash is the first-position token. The code in the anonymous subs parsed the line of the flatfile, and then send this data to XML::Writer to create the XML formatted text. I used four types of calls to XML::Writer: emptyTag, startTag, endTag, and within_element.

The emptyTag calls were easiest. No hierarchy, just a single tag with parameters.

The startTag calls open up a section of hierarchy. This is also easy.

The endTag calls were slightly trickier. My code could detect where a piece of hierarchy was supposed to end. To remember what kind of closing tag is needed, the within_element call detects if a particular tag has been opened. This approach wouldn't work for multiple levels of hierarchy, but this format doesn't have that. Other tools with different formats may have this requirement, so this may need to be revisited someday.

Any good translator should make a lossless round-trip with the data, unlike babelfish. I used XML::Twig to process the data and recreate the flat file. I used a hash of TwigHandlers, which called separate subs for each type of tag. I noticed that there is symmetry in the code with the parser and the writer of the data, particularly in the code that has to read the flat file and understand the order of the fields. This same ordering is needed to take the XML field values and put them into the flat file. I was not able to take advantage of this symmetry, so I ended up with code that I feel could be improved somehow. I also ended up with the fields being described in the module documentation, so now I have the order in three places instead of one. Darn!

I used the XPath approach to parse the XML. I had the problem that the flat file data was not available until the closing tags were parsed, so things tended to come out in an order reminiscent of reverse polish notation. I used some local variables to store things so that they could be written out in the correct order once the closing tag was detected. This is analogous and possibly symmetric with the endTag manipulations in the XML writer. Once again, it will cause problems when deeper hierarchy is needed and is an opportunity for removal of redundancy in the code.

The biggest challenge in this project was determining the proper type of calls to use in XML::Twig. There are many to choose from! XML::Writer was much easier. This follows the general principle that it is easier to transmit than to receive.

New Modules and other activities
Installed Spreadsheet::WriteExcel with cpan.
Install okay.
Tried test program from previous version (0.39) It broke compatibility with gnumeric, so I reported the problem to jmcnamara with msg on perlmonks. I hope he fixes it, I really like both WriteExcel and gnumeric.

Installed Math::SnapTo with cpan
Install was okay, except I got an old version so I reinstalled by hand, which worked fine.
Tried a bunch of test cases, I wouldn't use this module - it seems to have many bugs.

Posted on problems with a new snippet. Noted that root cause of rounding problems were caused by typing lots of digits of pi instead of using 4*atan2(1,1).

Wednesday December 25, 2002
10:09 PM

Order of tags in XML files

One of the things I look for in an application that stores its data in XML is whether or not the application fails if the ordering of the tags in the XML is changed. Some applications that use XML format don't seem to even look at the values inside the tags, they just throw out everything between < and > and instead depend on the order in the file to determine the meaning of the data. I sure don't want this to happen to any of my applications!

I've been using XML as a file format to represent CAD data. I translate from a proprietary format from a CAD vendor into XML, do something to the data, then write it back out in the proprietary format. I want to make sure that if I happen to reorder the XML tags in this process that I don't create invalid data when I write it back out, because the proprietary format has order-dependency.

I can imagine a few ways to handle this, and as usual there is a speed/memory/program-complexity tradeoff.

I am using the most excellent XML::Twig module, using the online tutorial and the O'Reilly book, Perl & XML. One thing the docs are a little thin on is examples of using the XPATH capabilities of XML::Twig. Here is an example that I made:

use XML::Twig;

my $fn= 'traces_small.xml';

my ($tag, $att, $value) = ('NET','name','/P5C');

my @example;
push @example, sprintf('TRACES/STFIRST');
push @example, sprintf('%s', $tag);
push @example, sprintf('%s[@%s]', $tag, $att);
push @example, sprintf('%s[@%s="%s"]', $tag, $att, $value);

foreach my $pattern (@example)
{
  print "Matching XSLT expression $pattern\n";
  print "TwigRoots\n";
  my $xml= new XML::Twig(
    TwigRoots => {$pattern => 1},
    error_context => 1,
  );

  $xml->set_pretty_print('indented');
  $xml->parsefile($fn);
  $xml->print;
  print "--------------------\n";
}

foreach my $pattern (@example)
{
  print "Matching XSLT expression $pattern\n";
  print "start_tag_handlers, original_string\n";
  my $xml= new XML::Twig(
    start_tag_handlers => { $pattern =>
      sub
      {
        print $_[0]->original_string,"\n"
      }
    },
    error_context => 1,
  );
  $xml->parsefile($fn);
  print "--------------------\n";
}

Now I want to make a module to create my CAD data in a certain order, independent of the order of my input data.

Approach 1
Put tags on the data that say what order they should be in.
I don't like this approach because I want my XML format to work for different CAD tools that have different requirements for the order of their data. One of the main purposes of the XML format is to have it be CAD tool independent. So order properties are right out.

Approach 2
One pass per section
In this approach, I would parse the XML file as many times as needed, each time printing only the next section of the CAD data. Example: if it were HTML, I might have one pass for the header and one pass for the body. This is easy to code and takes a minimal amount of memory, but it is CPU intensive.

Approach 3
One pass, store data in an array
Here I would store the data into elements of an array. At the end of parsing the array would be printed out to the file, and the order of the elements in the array would take care of the ordering of the output file. In the HTML example, $array[0] would hold the header and $array[1] would hold the body. This approach is memory intensive, since I have to store all the CAD data in an array.

Approach 4
One pass, store the data in an array of files
This approach is like the array, except each element of the array is an open file handle, and the data gets printed to the different files. At the end of the processing the files are appended to each other. This approach is file-handle intensive, and is probably not a good idea when there are, say, 100 or so file handles open at a time. This type of approach tends to run into the kernel parameter for the maximum number of open files for a process.

Since the trend these days is to throw RAM at problems, I'll try approach 3 first, and perhaps have an option in my code to use approach 4 or possibly 2.

02:27 AM

Metadata for toma's journal

[ #9624 ]
I've had one person request that I keep a journal, and I've often wanted to keep one, so I figure I have two potential readers so far.

I've often posted on perlmonks, but I haven't revealed much of what I actually do with perl. So a journal should be handy. I tend to write journal fragments throughout my Mandrake 8.2 system, recording my adventures in coding. I typically investigate about five perl modules per month, and one major open-source package. I'm going to take a shot at recording my adventures here.

Other activity can be found on tomacorp (We're not a corporation), which is mine.

At work I use perl quite a bit. I'm back in the CAD business, trying to make life better and more productive for a few hundred electrical engineers. I have switched between designing hardware and software over the past twenty-some years, and I have recently switched back to software.

At home I use perl as a hobby and I have also been teaching it as a high school course to a very small class.