Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

mugwumpjism (1871)

mugwumpjism
  (email not shown publicly)
http://utsl.gen.nz/

Journal of mugwumpjism (1871)

Tuesday August 07, 2007
01:49 AM

Scriptalicious 1.11

Apparently people use Scriptalicious; more than I thought. I have fixed a couple of the longstanding niggles with this module and made a release. See shortlog for a complete list of changes or the bundled Changes POD.
Thursday August 02, 2007
02:57 AM

Pumpkings, past and present

As I retrace the steps and patchwork of the early pumpkings, I have had the pleasure to meet and discuss the work with two of them recently, Tim Bunce and Gurusamy Sarathy. Tim was pumpking for the 5.004_* maintenance series and many _50+ releases; Gurusamy looked after a few after Malcolm Beattie. They gave me some hints and tips, particularly explaining what the goodies in Porting/ are about, and clarifications of the attribution styles.

It turns out that the Perforce backing store might be easy enough to trace through, making for a smooth, clean conversion. Extra information will be accompanied with a Catalyst Application. This is in addition to a third-party extraction of the information as performed by John Peacock, and all credits due for performing this conversion in something of record time. More data and conversions are good especially when they can be used to cross-check each other.

Wednesday August 01, 2007
03:26 AM

I has a loverly bunch of Catalysts

Here they are all sitting on my servers.

Beware, they're not such pretty Catalysts

rebasing soon with good committers!

Once I see if it's in the repo anywhere :). And other good things like all branches visible. See also a BAST import. Couldn't see DBIx::Class there - what happened? Well, If anyone wants to help muck in with conversion, grab the source data, and tell me where it needs to go. You can reply on this comment if you like. It's a bit like a bugtracker except when you can more easily ignore foolish requests. I can give out logins to that machine as required.

Anyone brave enough to add support to git-svn to mirror the root path of projects and mesh together a superproject using git-submodule ? This would be a useful final audit. Also looking for people who might be keen to set up repo on the UTSL network.

It's unfinished - some repos are still yet to copy. No accuracy or completion claims yet. But coming soon.

Wednesday September 06, 2006
10:12 AM

cpan6 - moving forward

Mark Overmeer gave a talk with support from myself at YAPC::Europe 2006 about cpan6, the design so far is the result of a collaboration between Mark and myself. The talk was generally well received, and during the conference we have heard many more peoples' concerns. The good news is that there were no new requirements that didn't fit cleanly in the design, in fact it gave some people lots of ideas. I think I can say that we have support for the general direction of things, and now we can open up the debate widely, and start implementing pieces.

I invite people to join either the pause6 mailing list (for infrastructure discussions) or the cpan6 tools list (for client-side installers and upload tools).

The earliest task will be looking at the big picture, and seeing which pieces are the low-hanging fruit that we can write tests for straight away. I'll start the ball rolling after we have a few subscriptions.

Wednesday August 23, 2006
07:27 AM

Database - Slave or Master? 3 of 3 - Integration

This story begins with an effort to store Moose classes in a Tangram store. Specifically, converting from Moose::Meta::Class objects to a Tangram::Schema structure.

The structures are already quite similar. In the Tangram schema, you have a per-class map of (type, name, (details...)). In Moose::Meta::Class, you have a map of (attribute, (details...)), where the details includes a type constraint. Based on the type constraint, you can guess a reasonable type. Well, not quite. The next thing you really need is Higher Order Types on your type constraints (called parametric roles in the Perl 6 Cannon). In a nutshell, that's not just saying there's an Array somewhere, but saying there's an Array of something. Then you can make sure that you put an actual foreign key or link table in that point in the schema, rather than the oid+type pair that you get with Tangram when you use a ref column (and, in recent versions, without specifying a class). Getting parametric roles working in Moose is still an open question, but certainly one I hope to find time for.

So, during this deep contemplation, I thought, well, what would Tangram be adding? I mean, other than the obvious elitism and other associated baggage? Why not just tie the schema to the Moose meta-model, and start a new persistence system from scratch? Or use DBIx::Class for all the bits I couldn't be bothered re-writing?

In principle, there are reasons why you might want the storage schema and the object metamodel to differ. You might not want to map all object properties to database columns, for instance. Or you might want to use your own special mapping for them - not just the default.

Then I thought, how often did I do that? I added a transient type in Class::Tangram for columns that were not mapped, but only rarely used it, and never for data that I couldn't derive from the formal columns or some other truly transient source. I only used the idbif mapping type for classes when I didn't have the time to describe their entire model. So, perhaps a storage system that just ties these two things together would be enough of a good start that the rest wouldn't matter.

The Evil Plan to NOT refactor Tangram using DBIx::Class

Ok, so the plan is basically this. Take the Tangram API, and make the core bits that I remember using into thin wrappers around DBIx::Class and friends. Then, all of the stuff under the hood that was a headache working with, I'll conveniently forget to port. That way, it won't be a source compatible refactoring, just enough to let people who liked the Tangram API do similar sorts of things with DBIx::Class.

The first thing I remember using is a schema object for the connection, if only because of acme's reaction when I say "schema". In a talk I'd use a UML diagram at this point, but given <img> tags are banned, instead let's use Moose code.

package DBIx::Moose::Schema;
use Moose;
has '$.classes' => (is => 'ro',
                     isa => 'Set of Moose::Meta::Class',
                     );

Alright. So, we have a schema which is composed of Moose Classes. The next thing we need is a Storage object that has the bits we want;

package DBIx::Moose::Storage;
use Moose;
use Set::Object qw(weak_set);
has '$:db' => (is => 'ro', isa => "DBIx::Class::Schema");
has '$:objects' => (is => 'rw', isa => "Set::Object",
                     default => sub { weak_set() } );
has '$.schema' => (is => 'ro',
                    isa => "DBIx::Moose::Schema");

That weak_set is a little bit of magic I cooked up for nothingmuch recently. All we're doing is keeping references to the objects we've already loaded from the database, primarily for transactional consistency. Actually, Tangram uses there a hash from an oid to a weak reference to the member with that oid, but I think that oids suck. In Perl memory, the refaddr can be the oid.

And we'd need an overloaded query interface;

package DBIx::Moose::Remote;
use Moose;
has '$._storage' => (is => 'ro', weak => 1,
                     isa => "DBIx::Moose::Storage");
has '$._class' => (is => 'ro',
                   isa => "Moose::Meta::Class");
has '$._resultset' => (is => 'ro',
                       isa => "DBIx::Class::ResultSet",
                       default => \&_rs_default,
                       );
sub _rs_default {
     my $self = shift;
     $self->_storage->resultset($self->_class);
}

So, hopefully, the DBIx::Class::ResultSet API will be rich enough to be able to deal with all the things I did with Tangram, or at least it will given enough TH^HLC.

There will be a bit of double-handling of objects involved. Basically, the objects that we get back from DBIx::Class will be freed very soon after loading, their values passed to a schema-specified constructor (probably just Class->new), and then their slots that contain collections that are not already loaded set up to lazy load the referant collections on access. This happens already in Tangram; the intermediate rows are the arrayrefs returned by DBI::fetchrow_arrayref(). So there will be lots of classes, perhaps under DBIx::Moose::DB::, that mirror the objects in the schema. Perhaps we don't need that, but it should be a good enough starting point, and if it can be eliminated entirely later on, then all the better. (Update: Matt has kindly pointed me to the part of the API that deals with this; this shouldn't be a problem at all)

Mapping the Index from the Class

One of the nice things about a database index is that it's basically a performance 'hack' only (because databases are too dumb to know what to index themselves), and do not actually affect the operation of the database. So, for the most part, we can ignore mapping indices and claim we are doing the 'correct' thing ;).

That is, unless the index happens to be a unique index or a primary key. What those add is a uniqueness constraint, which does affect the way that the object behaves. So, what of that?

Interestingly, Perl 6 has the concept of a special .id property. If two object references have the same .id property, then they are considered to be the same object. This has some interesting implications.

After all, isn't this;

class Book;
has Str $.isbn where { .chars < 255 };
method id {
     $.isbn;
}

The same thing as this?

CREATE TABLE Book (
    isbn VARCHAR(255);
    UNIQUE PRIMARY KEY (isbn);
);

So, we can perhaps map this in Perl 6 code, at least map one uniqueness constraint per type. Generalising this to multiple uniqueness constraints is probably something left best to our Great Benevolant Navel-Gazers. In the short term, we'll need to come up with some other kind of way of specifying this per-class; probably a Moose::Util::UniquenessConstraint or somesuch.

Mapping Inheritance

Alright, so we still have inheritance to deal with. But wait! We've got a bigger, brighter picture with Moose. We've now got roles.

Fortunately, this is OK. The Tangram type column was only ever used (conceptually, anyway) to derive a bitmap of associated (ie, sharing a primary key) tables that we expect to find rows in for a particular tuple. So, if we map the role's properties to columns, then we only have to "duplicate" columns for particular roles, if those roles are composed into classes that don't share a common primary key.

The other features

Well, there may be other important features that I'll remember when the time comes, but for now I think there's enough ideas here to form a core roadmap, or at least provide a starting point for discussion.

Monday August 21, 2006
05:13 PM

Database - Slave or Master? 2 of 3 - Object Persistence

One of the coolest things about "Object Persistence", is that it has the word "Object" in it, which of course means better than anything that was around before "Object Oriented Programming" was decided to be flavour of the decade. Even better than that, it even has the word "Persistence" in it, which sounds much more sophisticated and modern than "Database" or "Relational".

Then are shiny tools in this space, like Hibernate for Java. Using Hibernate, you can make an Enterprisey Macchiato Java Servlet that blends your database to a set of Java objects, and then provides Soaped-up Beany Hyper-galactic XML Web services for other parts of your Enterprisey code base to access. It's fantastic - you end up with a set of tables (all with surrogate IDs, of course) that you are guaranteed to not be able to write to safely from anything except the Java Server. This puts the Java developer in control. Which is the way (s)he likes it. Maybe hibernate doesn't have to work like this, but (s)he prefers it because it means that all the changes to the objects have to go through the same central set of functions. Otherwise, the development investment is wasted. And we can't have that, not at the price it cost.

Anyway, Tangram is not quite so antisocial as that. It at least behaves transactionally, given appropriate use of $storage->unload_all (also ->recycle) and distribution of clue. But it is currently anti-social in other ways, such as the surrogate ID requirement.

Wait a minute - the database has .oids, too

Postgres, and Oracle, both have a concept of .rowid; all tables except for 'index organised tables' have them by nature. I have observed that in the vast majority of code that uses the Tangram API, I never need to use or refer to this .id; in fact, when storing an object in multiple stores, its .id will vary across those stores. In light of this, while I consider surrogate IDs a design flaw - it's not a tragic one, it's consistent with what the database does anyway, and it has allowed for interesting patterns to be built in the mean-time while better ideas come forth. For a more detailed analysis of what I think is wrong with Tangram, see the bugs POD file in the Tangram distribution, especially the section on surrogate type columns (actually I've just tidied those up, if you're reading this before I make a release then read the fresh one here).

What defines "Object Persistence"?

Again, hazarding a set of common features of object persistence tools that could plausibly form part of a definition;

  1. They normally do have requirements of the database; usually not all valid DDL models can be mapped to a set of objects.
  2. They will map features of objects not usually considered relational concepts such as inheritance and Perl structures like Arrays and Hashes.

What's so cool about Tangram

The key design feature of Tangram is what is frequently referred to as being orthogonal - it is normally non-intrusive on the objects being stored. A given object may even exist in multiple stores simultaneously (but be represented by the same Perl object). The result? Classes do not need to be aware of their storage; any more than a tuple needs to know it's being stored in a table space.

This is implemented with Lazy Loading. The in-memory data structure is considered equivalent to the database form; via types such as Tangram::Type::Set::FromOne, it is possible to follow joins between tables by just walking Perl objects with visitor iterators like Data::Dumper.

Tangram Querying

For the cases where you have specific questions for your data model, and you are not just following adjacent relations between objects, lazy loading is not enough. We still need some form of query syntax.

For this, Tangram uses Tangram::Expr objects that represent database objects - and they use overload so that you can write your query expressions using standard perl operators (as far as overload allows). Depending on your inclination, you either "run screaming" from this syntax or love it.

In my experience, Tangram's query syntax makes some previously hard queries easy, and some "impossibly difficult" queries easy. You can build intricate joins with a consistent notation. For example, process a form, make a list of Tangram::Expr fragments, and then combine them into a filter that can be used for multiple queries.

  # get the table aliases
  my ($r_artist, $r_cd, $r_track)
      = $storage->remote(qw(Artist CD Track));

  # build a set of filter expressions - some of these
  # represent joins.
  my @filters =
      ( ( $r_artist->{name} eq "The Black Seeds" ),
        ( $r_cd->{artist} == $r_artist ),
        ( $r_cd->{tracks}->includes($r_track) ),
        ( $r_track->{name} eq "Heavy Mono E" )      );

  # AND them all together
  my $filter = reduce { $a & $b } @filters;

  # then use them for queries
  my (@cds)    = $storage->select( $r_cd, $filter );
  my (@tracks) = $storage->select( $r_track, $filter );

The query there is already getting reasonably impressive; the first ->select() maps to:

SELECT
    t1.id,
    t1.type,
    t1.artist_id,
    t1.name
FROM
    CD t1,
    Artist t2,
    Track t3
WHERE
    t1.artist_id = t2.id         AND
    t2.name = "The Black Seeds"  AND
    t3.cd = t1.id                AND
    t3.name = "Heavy Mono E"

This is a simple example, and I have found that there are very few real queries on well designed schema that do not map to this syntax well. That being said, sub-selects require an undocumented syntax, and while I have some sympathy to the notion that you should be able to write sub-selects as joins most of the time, it's certainly an example that the API hasn't been extended in all directions yet.

Tangram Maps Inheritance

There are those that would say inheritance is about a relational a concept as an .mdb file, but I think that there is adequate justification for its use in data modelling.

A good question to ask when validating a relational schema to be normal form, is "what does this relation mean?" or "what fact is being represented by this tuple?". We can ask this question for all tables - and the basic answer is "there exists an object with these values"¹. The fact is that the object exists. Better answers can be made for individual tables; consider that answer a template - ask a meta-question, get a meta-answer.

This is where the argument for inheritance stems. The relations still describe existence of an object, but certain types of objects will have extra items in their tuple - relations to the extra properties bestowed upon them by their sub-classes.

In the CD store schema, for instance, 'Artist', 'Person' (perhaps better called 'Musician') and 'Band' are related like this. The justification is, that an artist can be either a musician or a band, but if we are relating to something in its capacity as an artist (ie, from the CD table, to say who released it), there also exists by association a relationship between the CD and all of the properties of the artist in its capacity of a musician or a band.

Tangram short-cuts the query overhead of this situation using a 'type' column. The type column is an index into the schema, and is used to derive a bitmap of which extra tables associated with a base class are expected to have rows for this primary key. This is a de-normalization of data, so technically a hack - as noted on Tangram::Sucks, it should be possible to detect the type using the presence or absence of tuples in the corresponding tables. Or, somewhat equivalently, NULLs when using "Horizontal" mapping - see Tangram::Relational::Mappings for a description of these terms). I'm told that David Wheeler's Object::Relation can work like this.

But what about the Schema?

Having a schema structure that is free from side-effects can be quite useful. Tangram has this down well; its input is a plain set of perl hashes and arrays, no side effects. If you want to use the pure objects to create code, you can still pass them to Class::Tangram::Generator. If you want to connect to storage, pass to Tangram::Storage. T2 was my attempt at making a set of objects that can both describe this Tangram model relationally, and itself be stored in a Tangram database. This is useful for building rapid application development / Computer-Aided Software Engineering (RAD / CASE) tools. Consider Umbrello; it could not compile classes as the objects were manipulated, otherwise you might override internal behaviour and break the running program!

You don't have to write comprehensive schemata any more

Consider the package manager, Fink. Whilst using Storable for persistence can make applications like Fink faster by reducing parse time to load their entire state at start-up, it is still not as fast as a Berkeley, ISAM or SQLite-style database which is loaded on demand for small accesses.

The general approach is not making the whole schema relational in one go, but instead cherry-picking out the columns that you think are useful enough to be indexed, and throwing the rest into a single column that contains a Storable, Data::Dumper or even YAML data field which is used to construct the rest of the object. Tangram::Type::Dump::Any is built for this. I wrote a Tangram schema for Fink that does this, which is lurking here.

You end up with a data source which can be queried on all mapped columns, and almost all close that was written for the old, non-Tangram system works too - because previously, the only option was to follow Perl references, but we've made sure they all get lazy loaded.

Where Object Persistence Wins

RAD-developed, and imported models
In the RAD case, the model for your program is developed with a tool; the relational mapping is then derived by mutating the generated model.
In the imported case, it comes from the metamodel of an another module, such as Class::Tangram or Class::Meta.
In both of these cases, a general form of translation is "all that is required" - write a few rules about how to convert from one metamodel to another, and you have automatic Object Persistence. Sadly this "all that is required" part can get quite difficult to understand and debug.
retro-fitting storage around existing objects
This works out best when you have code that already stores via something like Storable, and hasn't been written relationally in the first place, just like Fink.

Yes, I know this is another absurdly long post in a multipart series. That's actually mostly in this case because I have more to say about it, rather than being a particular endorsement of the approach. But more on what I will endorse in the next part.

Footnotes:

  1. Yes, I know there is a widely circulating school of thought saying "that's not The Right Way™ to do object-relational mapping, you should be using object values as columns and tuples as object relations". The former isn't available in current databases, and the latter is done using classes that consist only of foreign keys (Tangram::Type::*::FromOne relations).
Saturday August 19, 2006
11:53 AM

Database - Slave or Master? 1 of 3 - Database Abstraction

After the ACID revolution of the 1960s, Relational Database Design was the next big thing during the late 60's and 70's. It marked an evolutionary step forward from the Heirarchical models of early ACID conformant systems; for after all, it included the heirarchical model, as any heirarchy can be expressed as relations¹, yet transcended it by expressing structures that didn't fit heirarchies.

And it has some solid theory behind it as well - the relational model has strong roots in mathematics and logic, and so you can expect that University-goers will be poring over it with a bit more scrutiny and peer review than your average use.perl.org columnist.

Through all this, we have a decent set of experience for looking at data management problems through the goggles of the Relational Model, of which modern Relational Database Management Systems (RDBMS's) provide a reasonable approximation². We have built it up logically with key concepts such as constraints, sequences, triggers, joins, views, cursors, etc, and well-known performance hacks such as indices, partitioning or materialized views. And this logical layering is what allows us to build complex RDBMS's and database applications that do not violate the strict requirements of ACID. Well, some of us, some of the time. I won't say it's easy to do it without making mistakes.

We have a set of rules that let you decide whether data in the model is normalized - that is, it is not duplicated or aggregating any other information in the database, or de-normalized. We should be able to look at a table, and decide whether that auto_increment primary ID key is actually a normalized and valid member of the model (such as a customer or invoice number), or whether it is just a surrogate ID thrown on the row so that the programmer doesn't have to guess whether table.id exists or not, that does not actually mean anything in terms of the data model.

We have a graphical language of notation, called crowfoot diagrams (example). And this is a very good common language.

We even have Relational abuses such as stored procedures and NULL values².

But we want a common language for writing Perl components, not just how for talking to DBAs or writing database schema. We cannot write entire applications in SQL. And nor do we want to.

What defines "Database Abstraction"?

For the heritage for this term, we can look to Dave Rolsky's POOP Comparison document. POOP stands for Perl Object-oriented Persistence, and stands out as one of the worst acronyms for a user group ever.

So, "Database Abstraction" is my own refactoring of the term "RDBMS/OO Mapper" from the above document. Modules such as DBIx::Class and Dave's Alzabo clearly fit into this category.

Allow me to hazard some key characteristics of modules strictly in this category;

  1. they (in principle) do not have particular requirements on table layout, such as surrogate IDs or type indicators
  2. they do not try to represent or provide concepts not described by orthodox relational model literature, such as inheritance

Perhaps I'll think of some others as time progresses; I'll try to add them here if I do.

What's so cool about DBIx::Class

In a nutshell, it does the Database Abstraction part very well, with a clean modular implementation via Class::C3 . Which isn't quite as chic as Moose , but close enough that it's probably not worth re-writing DBIx::Class in the near future. It has active maintainers, it has releases, it has users, it has mailing lists and IRC and all those other indicators of projects which are "succeeding".

One thing I particularly like about its API is DBIx::Class::ResultSet . In particular, the way that you don't get tables from your schema, you get result sets that happen to be for all objects. What's more, they don't actually run the query until you use them, which makes for easy piecely building of simplish queries.

Driving the Perl Objects from the Database Schema

One of the most popular DBIx::Class extensions, which I also think is pretty nifty, is DBIx::Class::Schema::Loader . This actually connects to a database, uses DBI's various interfaces for querying the table structure in about as DB agnostic a way as you could imagine a tool of its class doing, and then calls the relevant DBIx::Class hooks to create classes which are a reasonable representation of what it found in the database.

For those people who are adamant that best practices be strictly followed, and normalization guidelines honoured, this works very well - and it sure is a delight when you have an application with a database clean enough for this to work without tweaking the schema. Then again, those developing applications from scratch might prefer writing in DBIx::Class directly.

What's the model of your model?

In all of the above scenarios, but particularly with the Loader, the model (ie, schema) of your database has a meta-model (ie, governing schema form). It is a very close relative of the Data Definition Language, DDL - CREATE TABLE statements and so-on that tell the database what to do. And that is perhaps key to the success of DBIx::Class and perhaps all other modules that work like this - they piggy back on the success of the relational model.

It should be noted that the DBIx::Class meta-model is currently implicit; there is, for instance, a DBIx::Class::Schema module that lets you create objects for a model, but they just go ahead and make the classes immediately rather than a separate step. The closest thing I could find to a pure set of data structures that represent the schema was probably DBIx::Class::Schema::Base, but even that had the "side effect" of compiling the classes into Perl as the schema is constructed.

But that's not necessarily a harsh critique of a real problem. As an exercise, and for a RAD (Rapid Application Development) tool I was writing at the time to procrastinate from building a real application for a VC project, I developed a set of modules for Tangram called T2 that described the Tangram meta-model using the Class::Tangram meta-model. I later found myself wanting to do the same thing to Class::Tangram itself - that is, have Class::Tangram re-entrantly be its own meta-model. Other people have tried this sort of thing, too - Kurt Stephen's UMMF, David Wheeler's Class::Meta, etc. Metamodelling really amounts to the data modeller's equivalent of navel gazing - ie fruitful but only with good practice and a clear mind. I admire Stevan Little's accomplishment with Class::MOP in this regard, which is why I didn't cut my Moose talk.

But I digress. Why don't I summarise the usage scenarios where I think the Database Abstraction approach really wins.

Summary - Where Database Abstraction Wins

There we go, large heading and everything. I have observed Database Abstraction to be effective, both in my own practice but more in others, in two situations:

Well designed models
If the information has been modelled well using classical set theory notions, and those notions are adequate for the task at hand and little denormalization present in the data, then any approach that ends up getting to DBIx::Class classes will work well.
retro-fitting existing models
The DBIx::Class::Schema::Loader wins here. You already have a set of tables, you've defined your foreign keys properly using constraints and what-not, and it's not just a bunch of integer id keyed data dumping grounds, so just go ahead and load it all using a set of clearly-defined conventions.

Right, time to collect a free meal for my delayed flight, then I'll have a crack at part 2.

Footnotes:

  1. Yes, querying heirarchies in SQL sucks and usually relies on vendor-specific extensions which are inflexible and not portable. We will get to this a bit more in part 2 hopefully.
  2. Insert long rant about NULL values and duplicate rows here.
09:55 AM

International Transit Lounges, what fun

Well, here I am sitting at one of the handy power and internet outlets in Changli Airport in Singapore, hoping the paranoia caused by missing an international flight the last time I lost track of time sitting here will prevent the same from happening again. Checking in to the transit check-in desk, they informed me of a slight delay in my outgoing flight to Amsterdam, of the order of 5-7 hours. So, I've got another all-nighter to pull through - I wonder if I'll be ID'd at 4am by any assault rifle clad security staff this time around. On the bright side, that means I should be taking off between 8am and 10am in my home time zone. So, if I sleep deprive myself now I'll hopefully get some sleep on the 10+ hour leg over the continent, and also hopefully not miss a boarding call for my delayed flight being brought forward. Fun, fun, fun.

What better thing to do when sleep deprived but write talks, or in this case, the second set of rants I'm passing off as substitutes for my withdrawn YAPC talks. This second talk I really hated withdrawing; but sadly, I had some crazy things happen to me in the 11th hour of preparation, and when you're a hard core procrastinator like me, that can really throw a spanner in the works because that last hour is where most of the work gets done. So, I'll put the material and 'argument' here, and hopefully still be in the position to turn it into a good talk with slides and examples for the Australasian Perl conference in December (OSDC). Much, much kudos to my employer, Catalyst IT, for sending me to speak at such an insane number of international conferences this year (OSDC will be my third).

In case anyone missed it, this was the advertised talk topic:

Database - Slave or Master? DBIC vs Tangram

Whilst the DBI may be an excellent provide of database driver independence, just about every programmer who starts using the DBI ends up either building their own abstractions to its interface, or using somebody else's. As a result there are a multitude of modules in this space with significant overlap in functionality.

This talk compares two major categories of database management libraries - "Database Abstration" (DDL driven) and "Object Persistence" (metaclass driven). DBIx::Class (a module with some design roots in Class::DBI ) and Tangram (a prevayler-style persistence system) are examined as mature examples of each of these styles of access.

The plan at this point is to break it into three logical chunks - in the first part, I will put across my thoughts about the traditional approach of Database Abstraction used by DBIx::Class and other modules, where the database is considered to be the centre of the information. If nothing else, that should help clarify things like terminology and make sure that readers of the later parts are on the same page as me. In the second part, I will discuss the alternate approach used by Tangram, as well as its key advantages and failings. In the third part I will outline how I see this schism can be closed without losing the benefits of either or having to rebuild your applications from scratch (again).

Let the rambling begin.

Monday August 14, 2006
01:48 AM

What people love about their VCS - Part 4 of 4. darcs

With the shining review of git just posted, it seems there would be little ground left for other tools to show distinction.

However I respect and admire darcs on several grounds, and there are still clear and useful development interactions for which darcs has an advantage over all current git porcelain¹.

It's also properly distributed

Firstly, it should be noted that almost all of the distributed development advantages of git also apply to darcs. darcs also uses a single directory for its repository, so 'grep -r' is ok from sub-directories, and like git, it keeps these repositories with the checkouts so you can freely move and copy your checkout directories without worrying about using special commands or updating some mapping file in an obscure location in your dotfiles.

darcs has not been scaled to massive projects, instead focusing on smaller ones (say, a few thousand commits), where the extra functionality is considered more important than speed. That being said, in fact you'll see in newer darcs repositories the first traces of content hashing, which have made drastic improvements - and could eventually render git's performance edge marginal.

The (in)famous Patch Calculus

Patch Calculus has to be one of the most frighteningly named terms used in revision control systems today. It screams "Maths to University Level required to understand".

But let's throw away the awful term and describe it in plain Geek. Basically it's all about ways of determining, from a set of possibly unrelated patches, which extra patches are required for any given "cherry pick". I much prefer terms like Patch Dependency to refer to this set of concepts. Even darcs' term patch commuting could be better called patch re-ordering.

The theory goes like this. If you are trying to get a specific change from a tree, then quickly work out by examining the history which other changes are required first, and so add all of those patches to your tree.

The general finding from this technology is that it is useful, but it opens a big can of worms. In essence, the version control system is tracking not only the history that you recorded, but also all the different paths through the patches you have made that history might have successfully progressed. And on any code base, simple metrics such as "does this patch apply cleanly" cannot be relied upon to verify whether or not two changes are actually interdependant.

So, what some developers do is manually mark which patches are predecessors to the next patch that they make. Even more enlightended developers use metrics such as whether or not the changed code still compiles successfully or even passes the regression test suite to consider changes dependant.

Whether patch dependency works or not in practice depends on whether or not developers create commits of a high enough standard that they co-operate with this feature.

Interactive Commit

I didn't talk about this much in the SVK section despite SVK having this feature, mainly because darcs is where the feature came from in the first place.

Basically, the way it works is when you record changes, you are presented with a summary of the changes, then asked to group each change, hunk by hunk into bundles which are darcs patches.

This is largely how it is possible for the patch calculus to work so well - if changes to a single file are combined into a single commit as so frequently happens with file-grained commit in other VCS, it entwines the two features being worked on to be co-dependant. The better the tool is at keeping changes small and tidy, the better - but if they are too small, the reverse happens - every feature is considered to be its own independant change.

¹ - And now darcs is a git porcelain, too

With the arrival on the scene of darcs-git , a git porcelain with the UI of darcs, I have access to the interactive commit interface of darcs already.

I don't miss patch dependency, because it is easily - and I would add, less confusingly - performed with git using topic branches (making a new branch for each new feature or stream of development), and the powerful tools of rebasing and cherry picking.

01:33 AM

What people Love about their VCS - Part 3 of 4. git

It is clear that the earlier posts in this series are light on details and teasers, whereas this post goes into much detail on each new feature. For this bias I offer no apology. There is no mistaking that within the period of one year, I have gone from being an outspoken SVK advocate to extolling the virtues of the content filesystem, git. And I am not alone.

Content Addressible Filesystem

There are many good reasons that super-massive projects like the Linux Kernel, XFree86, Mozilla, Gentoo, etc are switching to git. This is not just a short term fad, git brings a genuinely new (well, stolen from monotone) concept to the table - that of the content addressable filesystem.

In this model, files, trees of files, and revisions are all hashed with a specially pre-seeded SHA1 to yield object identifiers, that uniquely identify (to the strength of the hashing algorithm) the type and contents of the object. The full ramifications of this take some time to realise, but include more efficient delta compression¹, algorithmically faster merging, less error prone file history detection², but chiefly, much better identification of revisions. All of a sudden, it does not matter which repository a revision comes from - if the SHA1 object ID matches, you have the same object, so the system naturally distributes by model, with no requirement for URIs or surrogate repository UUIDs and revision numbers.

Being content-keyed also means you are naturally transaction-safe. In terms of the core repository, you are only ever adding new objects. So, if two processes try to write to the same file, this will succeed because it means that they are writing the same contents.

It also makes cryptography and authentication easy - you can sign an entire project and its revision history just by signing text including a commit ID. And if you recompute the object identifiers using a stronger hash, you have a stronger guarantee.

The matter of speed

The design of the git-core implementation is very OS efficient. People might scoff at this as a key feature, but consider this performance comparison;

SVK:

wilber:~/src/git$ time svk sync -t 11111 /pugs/openfoundry
Syncing http://svn.openfoundry.org/pugs
Retrieving log information from 1 to 11111
Committed revision 2 from revision 1.
Committed revision 3 from revision 2.
Committed revision 4 from revision 3.
  [...]
Committed revision 11110 from revision 11109.
Committed revision 11111 from revision 11110.
Committed revision 11112 from revision 11111.

real    227m36.096s
user    3m47.281s
sys     5m0.577s

That's 13,656 seconds to mirror 11,111 revisions.

Compare that to git :

wilber:~/src$ time git-clone git://git.kernel.org/pub/scm/git/git.git git.git
Checking files out...
100% (688/688) done

real    1m54.932s
user    0m2.825s
sys     0m0.468s

That was 115 seconds to mirror 6,511 revisions. The key bottleneck was the network - which was saturated for almost all of the execution time of the command, not a laborious, revision by revision dialogue due to a server protocol that just didn't seem to think that people might want to copy entire repositories³. The server protocol simply exchanged a few object IDs, then using the merge base algorithm to figure out which new objects are required, it generated a delta-compressed pack that just gives you the new objects you need. So, git does not suffer from high latency networks in the same way that SVN::Mirror does.

But it's not just the server protocol which is orders of magnitude faster. Git commands overall execute in incredibly short periods of time. The reason for this speed isn't (just) because "it's written in C". It's mainly due to the programming style - files are used as iterators, and iterator functions are combined together by way of process pipelines. As the computations for these iterator functions are all completely independent, they naturally distribute the processing, and UNIX with its pipe buffers was always designed to make mincemeat of this kind of highly parallel processing task.

There is a lot to be learnt from this style of programming; generally the habit has been to try to avoid using unnecessary IPC in programs in order to make best use of traditional straight line CPU performance, where task switching is a penalty. Combining iterative programming with real operating system filehandles can bring the potential of this speed enhancement to adequately built iterative programs. I expect it will only be a matter of time before someone will produce a module for Perl 6 that will automatically auto-thread many iterative programs to use this trick. Perhaps one day, it will even be automatic.

But that aside, we have yet to touch on some of the further ramifications of the content filesystem.

Branching is more natural

Branches are much more natural - instead of telling the repository ahead of time when you are branching, you simply commit. Your commit can never be invalid (there is no "transaction out of date" error) - if you commit in a different way to somebody else, then you have just branched the development.

Branches are therefore observed, not declared. This is an important distinction, but is actually nothing new - it is the paradigm shift that was so loudly touted by those arch folk who irritatingly kept suggesting that systems like CVS were fundamentally flawed. Beneath the bile of their arguments, there was a key point of decentralisation that was entirely missed by the Subversion design. Most of the new version control systems out there - bazaar-NG, mercurial, codeville, etc have this property.

Also, the repository itself is normally kept alongside the checkout, in a single directory at the top called .git (or wherever you point the magic environment variable GIT_DIR at - so you can get your 'spotless' checkouts, if you need them). As the files are saved in the repository compressed via gzip and/or delta compressed into a pack file, with filenames that are essentially SHA1 hashes, the 'grep -r' problem that Subversion and CVS suffered from is gone.

It means that you can explain that to make a branch, you can just copy the entire checkout+repository:

$ cp -r myproject myproject.test

Not only that, but you can combine repositories back together just by copying their objects directories over each other.

$ cp -ru myproject.test/.git/objects myproject/.git/
$ git-fsck-objects
dangling commit deadbeef...
$ git-update-ref refs/heads/test deadbeef

Now, that's crude and illustrative only, but these sorts of characteristics make repository hacks more accessible. Normally you would just fetch those revisions:

$ git-fetch ../myproject.test test:refs/heads/test

Merges are truly merges

Unlike in Subversion, the repository itself tracks key information about merges. When you use `svn merge', you are actually copying changes from one place in the repository to another. Git does support this, but calls it "pulling" changes from one branch to another. The difference is that a merge (by default) creates a special type of commit - a merge commit that has two parents (a "parent" is just a SHA1 identifier to the previous commit). Thus, the two branches are truly converged, and if the maintainer of the other branch then pulls from the merged branch, they're not just identical - they are the same branch. Merge base calculations can just look at two commit structures, and find the earliest commits that the two branches have in common.

To compare the model of branching and merging to databases and transactional models, the Subversion model is like auto-commit, whereas distributed SCM such as git provides is akin to transactions, with the diverged branch's commits being like SQL savepoints, and merges being like full "commit" points.

"Best of" merging - cherry picking

There is also the concept of piecemeal merging via cherry picking. One by one, you can pluck out individual changes that you want instead of just merging in all of the changes from the other branch. If you later pull the entire branch, the commits which were cherry picked are easily spotted by matching commit IDs, and do not need to be merged again.

The plethora of tools

Another name for git is the Stupid content tracker. This is reference to the fact that the git-core tools are really just a set of the small "iterator functions" that allow you to build 'real' SCMs atop of it. So, instead of using the git-core - the "plumbing" - directly, you will probably be using a "porcelain" such as Cogito, (h)gct, QGit, Darcs-Git, Stacked Git, IsiSetup, etc. Instead of using git-log to view revision history, you'll crank up Gitk, GitView or the curses-based tig.

The huge list of tools which interface with git already are a product of the massive following that it has received in its very short lifetime.

The matter of scaling

The scalability of git can be grasped by browsing the many Linux trees visible on http://kernel.org/git/. In fact, if you were to combine all of the trees on kernel.org into one git repository, you would measure that the project as a whole has anywhere between 1,000 and 4,000 commits every month. Junio's OLS git presentation contains this and more.

In fact, for a laugh, I tried this out. First, I cloned the mainstream linux-2.6 tree. This took about 35 minutes to download the 140MB or so of packfile. Then I went through the list of trees, and used 'git fetch' to copy all extra revisions in those trees into the same repository. It worked, taking between a second and 8 minutes for each additional branch - and while I write this, it has happily downloaded over 200 heads so far - leaving me with a repository with over 40,000 revisions that packs down to only 200MB. (Update: Chris Wedgwood writes that he has a revision history of the Linux kernel dating all the way back to 2.4.0, with almost 97,000 commits, which is only 380MB)

Frequently, scalability is reached through distribution of bottlenecks, and if the design of the system itself elimates bottlenecks, there is much less scope for overloaded central servers like Debian's alioth or the OSSF's svn.openfoundry.org to slow you down. While Subversion and SVK support "Star" and "Tree" (or heirarchical) developer team patterns, systems such as git can truly, both in principle and practice, be said to support meshes of development teams. And this is always going to be more scalable.

Revising patches, and uncommit

The ability to undo, and thus completely forget commits is sometimes scorned at, as if it were "wrong" - that version control systems Should Not support such a bad practice, and therefore that having no way to support it is not a flaw, but a feature. "Just revert", they will say, and demand to know why you would ever want such a hackish feature as uncommit.

There is a point to their argument - if you publish a revision then subsequently withdraw that revision from the history without explicitly reverting it, people who are tracking your repository may also have to remove those revisions from their branches before applying your changes.

However, this is not an unsurmountable problem when your revision numbers uniquely and unmistakably identify their history - and when you are working on a set of patches for later submission, it is actually what you want. In the name of hubris, you only care to share the changes once you've made them each able to withstand the hoards of the Linux Kernel Mailing List reviewers (or wherever you are sending your changes, even to an upstream Subversion repository via git-svn).

In fact, the success of Linux kernel development can also be attributed in part to its approach of only committing to the mainline kernel, patches that have been reviewed and tested in other trees, don't break the compile or add temporary bugs, etc. As they are refined, the changes themselves are modified before they are eventually cleared for inclusion in the mainline kernel. This stringent policy allows them to do things such as bisect revisions to perform a binary search between two starting points to locate the exact patch that caused a bug.

Before git arrived, there were tools such as Quilt that managed the use case of revising patches, but they were not integrated with the source control management system. These days, Patchy Git and Stacked Git layer this atop of git itself, using a technique that amounts to being commit reversal. In fact, the reversed commits still exist - it's just nothing refers to them - they can still be seen by git-fsck-objects before the next time the maintenance command git-prune is run.

So, Stacked Git has a command called uncommit that takes a commit from the head and moves it to your patch stack, refresh to update the current patch once it has been suitably revised, a pair of commands push and pop to wind the patch stack, a pick command to pluck individual patches from another branch, and a pull command that picks entire stacks of patches, which is called "rebasing" the patch series. And of course, being a porcelain only, you can mix and match the use of stgit with other git porcelain.

Far from being "so 20th century", patches are a clean way to represent proposed changes to a code base that have stood the test of time - and a practice of reviewing and revising patches encourages debate of the implementation and makes for a tidier and more tracable project history.

The polar opposite to reviewing every patch - a single head that anyone can commit to - is more like a Wiki, and an open-commit policy Subversion server suits this style of colloboration well enough. There is no "better" or "more modern" between these two choices of development styles - each will suit certain people and projects better than others.

Of course, those tools that made distributed development a key tenet of their design make the distributed pattern more natural, and yet it is just as easy for them to support the Wiki-style development pattern of Subversion.

In fact there are no use cases for which I can recommend Subversion over git any more. In my opinion, those that attack it on the grounds of "simplicity" (usually on the topic of the long, though able to be abbreviated, revision numbers) have not grasped the beauty of the core model of git.

Footnotes:

Many people, especially those with time, effort and ego invested in their own VCS, judged the features of git in very early days. Without being able to see where it would be today, they each gave excuses as to why this new VCS git offered their users less functionality. So, a lot of FUD exists, a few points of which I address here.

  1. git does do delta compression to save space (as a separate step)
  2. git can track renames of files, though it does not record this in the meta-data, and pragmatically the observation is that this is, overall speaking, just as good, if not better, than tracking them with meta-data.
  3. git is not forced to hold the entire project history, it is quite possible to have partial repositories using grafts, though this feature is still relatively new and initial check-outs cannot easily be made grafts. Patches welcome ;-).