Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Ovid (2709)

  (email not shown publicly)
AOL IM: ovidperl (Add Buddy, Send Message)

Stuff with the Perl Foundation. A couple of patches in the Perl core. A few CPAN modules. That about sums it up.

Journal of Ovid (2709)

Wednesday November 12, 2008
05:39 AM

Gutenberg API

[ #37863 ]

As far as I can tell from reading the archives and checking their Web site, Project Gutenberg does not appear to have an API. The closed I've found is an RSS feed and an RDF document. These don't really constitute and API, but the latter can be parsed for adding to an SQLite database. Still trying to figure this out, though. Trying to grab one version of their catalog in RDF format:

gutenberg $ tar -xjf catalog.rdf.bz2
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers
tar: Error exit delayed from previous errors

I was able to unzip their .zip version of the same file, but I was disappointed to learn that their Perl examples are rather old and no longer appear to properly parse the data.

But why would you care? Because I think I want to make this happen:

gutenberg --read "Art of War"

You know, sometimes I worry about posting neat ideas to use.perl for fear that someone would jump the gun and Just Do It. I realize now that this is foolish for two reasons. First, they Won't Just Do It. Second, if they did, I'd be happy just to have the project done :)

Suggestions welcome. There needs to be an easy way to update the database, track what a user has read, allow them to "bookmark" a book (or better yet, "annotate" a document"), etc. I've never used an eReader. I never gave a damn about them, really, because I like the feeling of a book in my hands. Still, this seems worthwhile.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • You may be interested in IndexData's OpenContent Index Web Service [].

    The service is a somewhat un-RESTful [], but it's still a useful API for searching for Gutenberg texts, as well as available titles from the Open Content Alliance [] and more.

    Oh, and a while back I wrote a CPAN module [] for talking SRU [], which is the protocol the service uses. SRU is a little bit like OpenSearch [] on crack. It's not difficult to craft the URLs yourself, so maybe just using LWP::Simple or something would work better and hide less :-)

  • Surely you just need 'bzip2 -d catalog.rdf.bz2' to uncompress the file.
    -- Ed Avis
  • A web site is an API. :-) And newsfeeds are a widely supported subset of that.

    If you think otherwise, you’re thinking in terms of implementation, not in terms of interface. The web’s architectural goal is to make it not matter whether the document you receive is served from a static file, generated dynamically from an SQL database, served statically from a store other than the filesystem, or… whatever else. In the end there’s just documents with links you can follow, and that’s

    • Except that a documented API at least implies that if it's not static, the designers will at least try to minimize changes (that is, if the designers are are aware of the issues involved). A Web site makes no such claims, in general. If they had something on their site which said "go ahead and scrape us, baby, it won't hurt!", then I'd be less worried. They don't say that, so the scraping route is, er, fragile at best.

      • Ah, that is what you actually meant. (Stability is not the first thing I associate with the term “API” – loose coupling makes the web work at all.)

        The Gutenbergsters should have a mailing list, do they not? Seems like a good idea to ask them if they’re willing to commit to permanent support of whatever they’d be willing to, and to state so publicly.

  • I had looked at packaging Gutenberg texts in dotReader with links to the chapters, etc. IIRC, it was going to be quite a mess at that point, but perhaps their editing and organization has improved.

  • This is something I've wanted ever since I've had a PDA (now a Blackberry) given to me by work.

    I haven't liked any of the mobile reader options I've seen, and since modern smartphones have a workable browser, I see little reason to build a Java app you have to port to different architectures.

    Anyway, I built a simple web app based on the very same catalog.rdf, and it's optimized for mobile browsing, by which I mean it has a very compact and minimal interface. []

    You can sea