Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

darobin (1316)

darobin
  (email not shown publicly)
http://berjon.com/

Journal of darobin (1316)

Sunday August 11, 2002
11:21 AM

New job

[ #7036 ]

I started working at my new job monday last. I can't say that it's been a hectic week: this being august most of France is on holiday. Also, my boss is in Corsica right now so I don't really have clear short term assignments.

However I have spent a lot of time learning about all the technologies that are used there, especially our main product which is very cool. It's a system that takes a schema for an XML vocabulary and generates specific encoders and decoders for it using automata. Thanks to that we have XML data that beats gzip'd XML by a fair margin in size, in decoder space, in memory usage, and in speed[1].

Key (buzz)words here are MPEG-7, MPEG-21, XML Schema, RDF, SVG, and BiM. At least half of those I didn't need before so I'm busy sharpening my knowledge about. It's amazing how things almost instantly appear in a far better light when you have a solid use case for them. I've always hated XML Schema for many reasons but after looking at what Expway does it seems much less hateable (mind you, the language of the spec is still beyond human understanding, and it does have very severe faults). I'm also finding out that Xerces 2.0 is a very high-quality tool, and I hope that the Perl binding to it soon implements most if not all of its interfaces.

Fun thing of the week relating to XML Schema: on my first day -- in fact first hour -- of using them seriously (as opposed to just fooling around to see what they're made of) I found a bug in the schema published by the W3C for the two attributes defined in the XML spec! Better still, it's not even a schema bug, the document itself isn't well-formed (it bind the XML namespace to a forbidden prefix)! Now if the XML WG can't use XML properly...

There's more to say, but I'm lazy :)

[1]the principle is simple here, as those are specific {en,de}coders instead of generic ones they perform much better. XML structure is often very predictable, especially if you have a schema handy. If you know that A must contain B and C in that order, you only need to encode the fact that you have an A to encode it all, etc. Of course the less structure you have the less efficient it is.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • So, it sounds like you're using XML Schema as a form of BNF for (XML) documents, and generating code based on that grammar. Is that correct?

    If so, then in what ways are you getting better performance than gzip'ed XML? Parsing? Serialization? On-disk representation (using a shorthand vocabulary)? Memory consumption? Inquiring minds want to know. :-)

    Also, if you can discuss it, what kind of XML documents are you dealing with? Are they document-centric or data-centric? I would imagine there's a lot

    • Sorry for the hand-waving, I'm not yet fully familiarised with what in that technology is public and what will become public in the year to come :) Part of my job will be to see how we can open source some things, but I need to know a lot more about what falls into what category to be certain I'm not violating NDAs. But I'll try to give you as much satisfaction as possible here :)

      If you don't mind, I'll start with the end: document vs data centric results. When it comes to compression using this te

      --

      -- Robin Berjon [berjon.com]

      • And the decoding includes both uncompression and reading of the XML in a single step, because those two things are really one.

        OK. If I understand you, then you're repeating what the WML group tried to do in the late '90s, but in a much better and more general fashion. It sounds like what you're producing isn't XML, but a compressed bytestream that can be treated as uncompressed, [schema-]valid XML with the appropriate auto-generated automata.

        Neat. :-)

        FWIW, at the March 1998 XML Conference in Seat

        • Yes, that's precisely it, with the difference that the MPEG folks learnt from the mistakes in WML ;) The compressed bytestream can indeed be treated as uncompressed because the decoder knows of the schema, and thus the infoset is completely there (for values of completely that match the needs of those apps, not the roundtrippability that, say, an XML editor would require). If the schema says that A must contain B followed by C, you only need to encode the presence of A, which can take as little as one bi

          --

          -- Robin Berjon [berjon.com]