Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

potyl (8582)

  (email not shown publicly)

Journal of potyl (8582)

Tuesday October 07, 2008
04:39 PM

From YAML to XML through a DTD

[ #37618 ]

The Bratislava PM web site has now an RSS feed. This feed is currently generated from a custom made YAML file that's transformed to RSS thanks to XML::RSS. This approach is simple and quite flexible but has some quirks.

First, it's almost impossible to verify that the format of the YAML file is following the default template without writing our own validation. For instance, if a feed entry is missing the title, the link or the date there's no built-in mechanism to inform us of this errors.

Secondly, the main content of each feed element is allowed to have HTML. In fact, all feed items that we have include HTML. Mixing HTML inside of a YAML file doesn't make the input file too nice since it has now two markup languages. Of course, one can argue that YAML Ain't Markup Language (tm), nevertheless it is weird to embed HTML in YAML.

Finally, converting YAML to XML seems strange. YAML is mainly used to provide data structures, configuration files or data serialization. Using it for content manipulation might be pushing it too far.

For this particular context XML seems more appropriated. Some of its advantages are that it's possible to validate through a DTD, an XML Schema or RELAX NG. HTML and XML can coexist without problems, specially if XHTML is used. And transforming an XML file into another an RSS feed can be easily done through XSLT.

Using XML as the input file provided some interesting advantages. First, thanks to a DTD not only can we validate each feed entry in the input file, but we can also validate the HTML that's embedded in the feed's description.

By using some clever XML and DTD hacks it's possible to create a custom made feed that can be validated without too much effort. Let's we assume that an RSS feed contains an events and that each event has:

  • title
  • link
  • description (can contain HTML)
  • subject
  • creator
  • date
  • id

The following DTD describes and validates a feed input file:

<!ELEMENT ba:events      (ba:event*)>
<!ATTLIST ba:events
    xmlns:ba  CDATA   #FIXED ""

<!ELEMENT ba:event       (ba:title, ba:link, ba:description, ba:subject, ba:creator, ba:date, ba:id)>

<!ELEMENT ba:title       (#PCDATA)>
<!ELEMENT ba:link        (#PCDATA)>
<!ELEMENT ba:description (#PCDATA)>
<!ELEMENT ba:subject     (#PCDATA)>
<!ELEMENT ba:creator     (#PCDATA)>
<!ELEMENT ba:date        (#PCDATA)>
<!ELEMENT ba:id          (#PCDATA)>

Although this DTD can be used for simple feed elements it has a problem: it doesn't allow any HTML inside the element ba:description! Does defeating the purpose of replacing YAML by XML. But all is not lost as this can be easily fixed by importing the XHTML DTD within our DTD and by redefining the element ba:description in order to accept any HTML tag that a div accepts:

<!-- Import that XHTML DTD -->
<!ENTITY % xhtml PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

<!ELEMENT ba:events      (ba:event*)>
<!ATTLIST ba:events
    xmlns:ba  CDATA   #FIXED ""

<!ELEMENT ba:event       (ba:title, ba:link, ba:description, ba:subject, ba:creator, ba:date, ba:id)>

<!ELEMENT ba:title       (#PCDATA)>
<!ELEMENT ba:link        (#PCDATA)>
<!ELEMENT ba:subject     (#PCDATA)>
<!ELEMENT ba:creator     (#PCDATA)>
<!ELEMENT ba:date        (#PCDATA)>
<!ELEMENT ba:id          (#PCDATA)>

<!-- The definiton of 'ba:description' is the same as a 'div' -->
<!ELEMENT ba:description %Flow;>
<!ATTLIST ba:description
    xmlns     CDATA   #FIXED  ""

Thanks to this new DTD the element ba:description can include any HTML element that's allowed within a div element. The DTD will make the validation and will ensure that valid HTML is inside the element. For instance, adding the element body to the element ba:description will be rejected by the DTD even though it's a valid HTML element it's not allowed to be within a div.

The element ba:description is declared in our DTD the same way that the element div is in the XHTML DTD. Furthermore, the element is allowed to set the default namespace to XHTML. Thus, making all child elements of ba:description to belong to the XHTML elements, this is very handy when processing the XML file latter on.

It's is not difficult to see that the new version of the feed will be generated from an XML file as using XML is quite advantageous here.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Are you aware of Kwalify [] and Data::Rx [], both schema languages for data structures?
    • No I wasn't aware that they existed, thanks for the pointers. I went quickly over the documentation and they look quite nice.

      Although, in our case the main goal was to mix the validation of our own elements and the ones provided by XHTML. Using Kwalify or Data::Rx would require us to transform (or worse to rewrite) the XHTML DTD in some other language.

      Also, using XML and XHTML has some other advantages that I will describe in future posts.