For a number of years I've been interested in the idea of open source data.
One of the lessons companies have learned over the last decade of the internet is that the data has an interesting value property.
As you linearly increase the amount of data you have, the value of the data increases at a rate that is higher than linear. Doubling the amount of data you hold more than doubles the value of that data.
One of the earliest people (a venture capitalist) to appear on the "Google Speaker Series" videos was talking about this exact situation.
His opinion on the main business challenge for Google was that they needed to find ways to enhance and maintain the trust of their users, so that their users would be willing to continue to upload pretty much all their information into the vast Google databases.
Google's Mission Statement "Google's mission is to organize the world's information and make it universally accessible and useful." is the user facing side of this challenge.
To be able to achieve this missions, Google needs to own (or at least have a copy of) pretty much all the data there is in the world. Google's profits come from their ability to store and process this huge data pile, to produce products and services that (because of the scale of the source data) are more efficient than anything else their competitors can make.
In the open source world (and in particular the CPAN world) this idea of having as much data as possible takes on a different dimension. We're generating data, but we're generating that data in silos, which is significantly limiting the value of that data.
In trying to convince people (and I'm looking here at PAUSE, rt.cpan.org and CPAN Ra(n)tings in particular) to export their data in a way that is immediately useful without the need for a separate parsing module, I think I might be running into the effort problem. The kind of people who own the data silos are the same kind of people that are too busy to come up with an export process.
With SQLite emerging as the "standard" method of exporting databases, and DBD::SQLite now rock solid, I've decided to try and fix this exporting problem.
The concept behind Xtract it is to take any arbitrary DBI connection, identify the subset of data you want to export, automatically generate a SQLite schema to hold that data, and then package up the SQLite file for distribution.
My plan is for this project is to produce a standalone tool (similar to ack) that will be useful to a wide range of people beyond the Perl world.
The first of these tools is starting to take shape. A trivial command to export everything from one SQLite database into another SQLite database currently looks something like the following.
> xtract --from DBI:ORDB-CPANTS.sqlite --to foo.sqlite --index
Connecting to data source DBI:SQLite:ORDB-CPANTS.sqlite
Creating SQLite database foo.sqlite
Configuring SQLite database
Publishing table author (7346 rows)
Publishing table dist (16602 rows)
Publishing table kwalitee (16601 rows)
Publishing table modules (64397 rows)
Publishing table prereq (49631 rows)
Publishing table uses (155102 rows)
Indexing table author
Indexing table dist
Indexing table kwalitee
Indexing table modules
Indexing table prereq
Indexing table uses
Cleaning SQLite database
Creating gzip archive
Creating bzip2 archive
Extraction completed in 126 seconds
Of course, this doesn't do much of anything. It's really just an extremely expensive implementation of a file copy.
It does, however, do all the right things you might want to do when exporting a database. The SQLite database gets all the right pragma speed and size tweaks, and gets built in the legacy file format (compatible back to 3.0.0).
If you expect end users to be doing queries directly on the export database (similar to how the ORLite::Mirror-based modules like ORDB::CPANTS deal with SQLite exports) the --index option instructs Xtract to generate an index on every column in every table in advance.
This is an option you have to decide explicitly for the moment, because generating all of these indexes seems to double the size of the database. In a future release, I'll make Xtract generate multiple variants of the database (with indexes and without indexes) if you choose.
A more interesting feature of Xtract is auto-typing. This lets you export (literally) any DBI statement. Provide an export table name and any SQL query, and Xtract will do a two-pass export of the query. The first pass scans the data returned by the query, identifying the type of data in each column, checking for null vs not null, and finding the best SQLite column type to fit the data. Then it takes a second pass over the query to do the actual table filling.
This wasn't needed in the demonstration case above, because when you export an entire table Xtract will first try to use the DBI APIs to fetch the table metadata directly, rather than doing data sampling.
My short term goals for Xtract are to get table metadata support implemented for MySQL, Postgres and Oracle sources, so that I can deliver something for Andreas to use to export the main PAUSE database.
I'm also hoping to use OpenAustralia.org as a non-Perl demonstration project, because I've been looking for an Australia-specific project for a long time. These guys track everything that is said in the two houses of parliment, but it's currently locked up in their internal MySQL database. It's also very very large, so it's a great test case for optimisation work.
If you have an interest in data munging or ETL, and you are interested in helping improve Xtract, I'd love your help.
You can check out a copy of the current code at http://svn.ali.as/cpan/trunk/Xtract/.