Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

acme (189)

  (email not shown publicly)

Leon Brocard (aka acme) is an orange-loving Perl eurohacker with many varied contributions to the Perl community, including the GraphViz module on the CPAN. YAPC::Europe was all his fault. He is still looking for a Perl Monger group he can start which begins with the letter 'D'.

Journal of acme (189)

Monday October 06, 2003
11:56 AM


[ #15090 ]
After discussing a lot about metadata this weekend  I've been playing with RDF and CPAN. Looking at all the distributions by authors which begin with an 'L', with DBD::SQlite and RDF::Simple, I now have a lot of triples. I've been adding some Dublin Core information. I have lots of information yet to add. So who thinks this is a good idea?

<rdf:Description rdf:nodeID="acmeColour017">
    <dc:publisher rdf:resource=""/>
    <dc:identifier rdf:resource=""/>
    <dc:type rdf:resource=""/>
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • It looks like an awfully verbose way of saying some very simple things. And I expect that for it to be useful for users they'll need to do XML voodoo. Which is HARD. I just don't see the point of using an obfuscatory format like RDF/RSS/XML/whatever it's called this week, rather than (eg) the output from Data::Dumper or YAML. Maybe I'm missing something.
    • by hfb (74) on 2003.10.06 13:45 (#24688) Homepage Journal

      RDF may be more suitable and appropriate for aggregation of the various metadata files relating to a single distribution. Much of it will be primarily for PAUSE and the indexers like search.cpan and the various tools people already use like so users generally won't ever need to look at the raw metadata unless they really want to.

    • I should probably have explained this a little more. I got really confused and all negative about RDF until recently. The main problem is that it's all in XML and that scares everyone, but RDF is really all about triples: subject, predicate, object. It just so happens that the most common serialisation format at the moment is in XML.

      So an interesting triple would be "LBROCARD" "is the author of" "Acme-Buffy-1.2". Or, in the RDF fragment about Acme-Buffy-1.2: "<cpan:id>LBROCARD</cpan:id>". Noti

      • You still have to make guesses about what a is, surely? At some point, a human has to decide that LBROCARD is the person who wrote Acme::Buffy, and that it's not some other random identifying feature like an ASCII-fied checksum.
        • That's what RDF vocabularies are for.

          If you stick


          in the RDF declaration, that lets you do something like this:

              <foaf:name>Earle Martin</foaf:name>
              <foaf:mbox_sha1sum>8699ba79a95abf86e0055c133bf5d87ceab921e9</foaf:mbox_sha1s um>

          Of course, there's going to have to be a CPAN vocabula

          • Still needs a human to read, parse and understand the fact that <foo> represents a FOO in the real world, and to write the code to handle FOOs correctly. That is, it requires just as much work as understanding what 'author' means in a structure such as:

            $VAR1 = { 'author' => 'Sheerluck Holmes', 'title' => 'true crimes and how to avoid them' }

            or a YAML equivalent.

            Using XML-ish things does not help to define what your data is, regardless of what it says on the bottle of Kool-aid.

            • Oh, OK, maybe I didn't follow your meaning. I wasn't meaning to imply that using RDF (and in the vocabulary itself, OWL []) would actually define what the data is. But yes, isn't that always going to be the case, until we have smart computers? At the moment, the closest thing to "encapsulated meaning" we have is Cyc [], and that's a long way off from being the real thing. RDF vocabularies, as you say, are good for defining relationships between things.

              I don't think, though, that RDF was ever intended to be hu

              • I always try to either use something that is explicitly designed to be human-readable, like Data::Dumper (with purity and indent style 2) or more recently YAML; or something which cares not about being human-readable, such as Storable or some other binary format. RDF/RSS/XML, because it's ASCII, looks like it's meant to be human-readable, so I try to read it and get irritated.
  • rdf/cpan (Score:4, Interesting)

    by inkdroid (3294) on 2003.10.06 13:27 (#24687) Homepage Journal

    Wow, I really like this idea. Is the idea to serialize CPAN metadata in a similar way to how the Open Directory Project [] makes their data [] available? Speaking as an ex-librarian, your use of RDF and DublinCore is commendable. People in the library and information science communities have been getting all excited about RDF and DublinCore for years, and it's is very cool to see someone putting it to practical use. I bet the the semantic web folks [] would also be very interested to hear about your experiments.

    On a somewhat related note: while it's a kind of eclectic the Open Archives Initiative [] has developed a protocol [] for sharing large sets of metadata. The OAI-PMH provides a very simple framework for building data providers and data harvesters using a set of 6 verbs over XML/HTTP: Identify(), ListIdentifiers(), GetRecord(), ListmetadataFormats(), ListRecords(), ListSets(). While it might not be of direct use, it could be of interest if you are looking for ideas on how to allow people to update their local copies of CPAN metadata without grabbing the whole lot each time. The OAI-PMH has its roots in the arxiv [] pre-print server at Los Alamos, and is currently being used by quite a mix [] of data providers. Oh, and I wrote Net::OAI::Harvester [] for interacting with repositories :-)

  • Using RDF (Score:3, Informative)

    by ziggy (25) on 2003.10.06 22:13 (#24695) Journal
    This snippet doesn't look entirely kosher. The urn::filesize and urn::mimetype elements need to be placed into a proper namespace.

    The RDF format is rather, um, ugly to behold. It's good for interchange between apps, but greatly obfuscates the meaning for wetware parsers. I think the following is a faithful interpretation of the above example in Notation 3 []:

    @prefix cpan: <>.
    @prefix dc:   <> .
    @prefix misc: <urn:empty>.

        cpan:dist     "Acme-Colour";
        cpan:suffix   "authors/id/L/LB/LBROCARD/Acme-Colour-0.17.tar.gz";
        cpan:version  "0.17";
        dc:date       "2002-04-11T15:54:11";
        dc:format     "application/x-gzip";
        dc:identifier <>;
        dc:publisher  <>;
        dc:type       <>;
        misc:filesize "3151";
        misc:mimetype "application/x-gzip";
    Here are some important elements that are missing but should be trivial to add:
    • Author ID
    • DSLIP values
    • MD5 Checksum
    • Module Prerequisites (as determined by Meta.yml or whatnot)
    • Minimum Perl version required
    Nevertheless, this snippet of RDF is a very good start. Thanks!
    • It was just a fragment, so it had no namespaces. Thanks for the feedback, it does now. Also I added Author ID and MD5 Checksum. More metadata from CPANTS and META.yml to come soon. I used RDF/XML as it was the simplest thing possible at the time and RDF::Simple was, well, simple. Anyway, you can check it out at: (autrijus is hacking PAUSE so I can replace the file instead of releasing new versions all the time).
  • First off, XML isn't the only possible serialization of RDF. Second, and more importantly, I think it's reasonable for CPAN metadata to be stored/provided as YAML... so long as it can be unambigiously mapped to RDF for those applications that need/want it.
    • I would argue that the world that uses XML/RDF is larger than the world that uses YAML. I have no statistics to back this up, it is just a gut feel. Safety in numbers is not really a good argument, but I guess the main thing that the data is *available* (thanks Acme) than what format it is in.
    • Actually, I'd argue with equal conviction that CPAN Metadata should be canonically stored in N3.

      My cat would argue even more strongly that we should design a database schema and shove all the data into {SQLite|MySQL|PostgreSQL}. Even Cats can understand third normal form. ;-)

      The one thing we really need is to agree on the triples and the meaning of the assertions that describe CPAN metadata. Everything else is just syntax. Mapping from one syntax or another (or deeming one syntax "preferred") is an e

  • Could you (and would it make sense to) add an rdf:about attribute to the Description tag pointing to the file on either the main CPAN site or the info page?