Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Alias (5735)

Alias
  (email not shown publicly)
http://ali.as/

Journal of Alias (5735)

Thursday July 02, 2009
01:18 AM

The impending death of BZip2

adam@svn:~/svn.ali.as/db$ ls -l
total 30884
-rw-r--r-- 1 adam adam 9558294 Jul  2 03:45 cpandb.gz
-rw-r--r-- 1 adam adam 8538979 Jul  2 03:45 cpandb.bz2
-rw-r--r-- 1 adam adam 5960155 Jul  2 03:45 cpandb.lz
-rw-r--r-- 1 adam adam 3014480 Jun 30 06:46 cpanmeta.gz
-rw-r--r-- 1 adam adam 2658756 Jun 30 06:46 cpanmeta.bz2
-rw-r--r-- 1 adam adam 1825600 Jun 30 06:46 cpanmeta.lz

Wednesday July 01, 2009
12:17 PM

CPANDB 0.02 - Now we're starting to get somewhere

On the back my my improved and high-coverage CPAN::Mini::Visit and Archive::* fixes, I've finally managed to build a complete-coverage version of CPANDB.

CPANDB is a merged and cleaned up schema that combines the CPAN index, the "CPAN Uploads" database (for PAUSE upload dates), and both class and distribution level dependency information held in META.yml files (replacing the CPANTS dependency graph).

To take a look at it, you can grab a copy of the SQLite database directly from the following URL.

http://svn.ali.as/db/cpandb.gz.

The data sources used to generate it are not perfectly time-synced yet, so I expect to see a few minor flaws for another release or two. But compared to everything else available (from pretty much everybody) this should be a significant improvement.

As well as clearing up the last tiny data quality issues, I'm also yet to merge in the rt.cpan.org database (which is almost ready) and the CPAN Ratings database (which is a text file I really don't want to have to parse).

But don't let this stop you trying it out now (I've appended the schema to the bottom of this post so you can get a clearer idea of what's in there).

As usual, feedback is welcome.

CREATE TABLE author (
        author TEXT NOT NULL PRIMARY KEY,
        name TEXT NOT NULL
);

CREATE TABLE distribution (
        distribution TEXT NOT NULL PRIMARY KEY,
        version TEXT NULL,
        author TEXT NOT NULL,
        release TEXT NOT NULL,
        uploaded TEXT NOT NULL,
        FOREIGN KEY ( author ) REFERENCES author ( author )
);

CREATE TABLE module (
        module TEXT NOT NULL PRIMARY KEY,
        version TEXT NULL,
        distribution TEXT NOT NULL,
        FOREIGN KEY ( distribution ) REFERENCES distribution ( distribution )
);

CREATE TABLE requires (
        distribution TEXT NOT NULL,
        module TEXT NOT NULL,
        version TEXT NULL,
        phase TEXT NOT NULL,
        PRIMARY KEY ( distribution, module, phase ),
        FOREIGN KEY ( distribution ) REFERENCES distribution ( distribution ),
        FOREIGN KEY ( module ) REFERENCES module ( module )
);

CREATE TABLE dependency (
        distribution TEXT NOT NULL,
        dependency TEXT NOT NULL,
        phase TEXT NOT NULL,
        PRIMARY KEY ( distribution, dependency, phase ),
        FOREIGN KEY ( distribution ) REFERENCES distribition ( distribution ),
        FOREIGN KEY ( dependency ) REFERENCES distribution ( distribution )
);

Thursday June 25, 2009
12:19 AM

The Top 100 website identifies its first OMGIBROKECPAN event

Because I've been distracted writing my new CPAN dependency graph generator (to repair the main flaw in the CPAN Top 100 website) I haven't been paying the website a whole lot of attention in the last week or so.

I just run the website data generator once or twice a week, and other than that, I'm focusing on creating the next iteration of the support software.

After updating the data file today, imagine my surprise when a new module absolutely roared ahead of the previous #1 highest score, posting a new record of over 500,000 FAILure points.

It would appear that the June 18th release of DBD::mysql was something of a disaster, and the high number of modules that depend on MySQL means that the impact on end users is very high.

It's good to see that when something like this happens, it is highlighted on the list very quickly and very obviously.

However, in retrospect I'm somewhat disappointed that it took this long to highlight it (because I don't run the updater often enough). What if it just took a while because it took a few weeks to gradually climb up to the top of the list, fighting against high-dependency modules with one or two rare failures, that aren't growing so fast.

I'm wonder what might be done to spot these dramatic failures faster, within a few days of their release, rather than having to wait a week or so.

Even starting to do something like that will probably require a serious improvement to the update date of the data that feeds the website, and changes to teach the analyser about the concept of time.

Food for thought, certainly.

The question now of course, is how high a really big OMGIBROKECPAN event might score. Something in the vicinity of several million failure points certainly seems possible, especially if the number of CPAN Testers installs grows.

Wednesday June 24, 2009
07:58 PM

The Swarm - Adding collaborative editing support to Padre

On #padre the other night, a discussion broke out on the subject of collaborative editing (ala MoonEdit, SubEthaEdit, etc). A number of people had never even heard of it before.

Collaborative Editing, as it is typically implemented, involves "connecting" your editor to someone else's editor. Instead of opening the files for local editing, your typing events and such are transmitted to the original host in real-time.

The effect is that two or more people can edit the same document in real-time, everyone with their own individual syntax highlighting, key bindings, preferences and plugins active.

After establishing that all the standalone libraries for doing collab (like libgobby) weren't going to be appropriate, we started pondering how we might implement it ourself, and what features OTHER than pure collaborative editing might be handy.

Some of the sorts of things you could probably do quite easily is to "steal" a copy of an open unsaved file from someone else's Padre, open a file for remote viewing only (think live remote code reviews), or allow remote users to take actions for code that can only be run on one physical system (due to exotic hardware etc).

Then there's the whole social aspect to explore. Putting your Padre into "Conference Mode" might enable all sorts of curious effects (like automatically notifying you if someone else on the local is editing a CPAN distribution that you have a checkout of as well) and let you do different things compared to collaboration with someone that is truly physically remote. Perhaps it might allow other conference attendees to hijack your editor so they can run module test suites on platforms they don't have... the possibilities for sharing development effort in a more intimate environment are quite extensive.

There's all sorts of different things that you might do, and all sorts of different methodologies you could apply to make them happen. Multicast, broadcast, UDP, things like Bonjour.

And just as useful for Padre is that doing collaborative editing would require significant improvements in the cleanliness of Padre's internals. Before we could make half this stuff work, Padre would need to have a much more robust and sophisticated understanding of the ways in which concepts align and mix together across the Main Window, Editor Panels, Documents and Projects.

Having permanently running clients and servers would drive improvements in Padre's Task API and background processing capabilities, and will require improvements in threading and IPC.

Until this notional day in the future when we actually know how to implement all of this in Padre safely and sanely, I've created a new Padre::Plugin::Swarm plugin to serve as a vessel for all the insane test cases.

"Padre Swarm" is intended to be a container for experimental work and shiny demonstrations. It will intentionally ignore stuff like robustness, scalability and security.

Swarm gives people a chance for people to show off the kinds of shiny toys that might be possible when you don't have to care that implementing it means allowing arbitrary remote code execution by the entire world, and that it only works on multicast-capable networks, and that in the process the Single Instance server doesn't work any more and Padre crashes if you try to refresh the Outline panel.

As usual, if you find this interesting or exciting, I strongly encourage you to drop in at #padre and have a chat. If you like what is happening, we can get you commit and everything else from there.

(As an aside, Padre is now sitting at around 20 active contributors. We now only need another 50% increase in contributors to catch and exceed the number of active contributors on Emacs) :)

Wednesday June 17, 2009
12:02 AM

Request for Assistance - The most important toolchain bug

Now that my blog is listed on the Ironman planet, Matt Trout has loaned me his chainsaw and suggested that asking for help here is likely to garner a response, because he says so.

So consider this an official request for assistance to help some overloaded developers fix what I consider to be the most important bug in the toolchain right now.

It's a bug in Archive::Extract, and it's probably not that much work, but neither Jos Boumans nor I have the free time right now to fix it.

The bug in question is that when Archive::Extract uses Archive::Tar to unroll a tarball, it uses the wrong API. Instead of using the memory-efficient streamed extraction API to roll the whole tarball out to disk directly, it instead loads the whole thing into memory and unpacks it from there.

It should probably use code similar to the implementation of Ivor's Archive::Tar::Streamed instead.

http://search.cpan.org/~ivorw/Archive-Tar-Streamed-0.03/

This is a big problem because once all the memory inflation and memory copying has happened to allow this loading to, a couple of big pathological distributions on CPAN consume almost the entire 2gig memory limit of the (32-bit) process.

This bug is making the performance of CPAN on Win32 much worse and memory-bloaty, but worse is that it takes CPAN::Mini::Visit over the process limit and crashes it, which also means that this bug is currently blocking work on the GreyPAN scanner experiments (Perl::Metrics2), the META.yml database ORDB::CPANMeta, the permissions-aware replacement for the rather unreliable CPANTS dependency graph (CPANTS::Weight, my unified CPANDB SQLite index, and the sorely-needed accuracy fixes for the Top 100 website.

Improving almost all these things require both accurate and 100% complete coverage of minicpan in order to give answers that are good enough to swap out the original first-generation implementations, and this one relatively approachable bug is preventing the ability to reliably reach 100% coverage.

Because this bug also disproportionately impacts Win32 and is a core module, this bug is also very important for the July release of Strawberry Perl, as well as the Perl 5.10.1 release.

If anyone out there has a few hours to attack this bug and fix it, your efforts will have a huge knock-on effect on the quality of many other parts of the CPAN ecosystem.

If you are able to help us out, you can find Jos (kane), myself (Alias), or other that can point you in the right direction in #toolchain on irc.perl.org.

Tuesday June 16, 2009
01:47 AM

Reducing your State when you aren't as smart as Yuval

Yuval Kogman is one of those people who are not just smart, but smart in the crazy math way that I simple don't connect with very easily.
Like Audrey Tang and the other math-smart lambacamels, when they speak on a topic they inevitably end up using language that non-math people like me have trouble relating to (my skills lie more in pattern recognition and iterative/emergence stuff).

So when he talks about immutability I certainly agree in principle. It's just that when the wisdom is laden with "functional purity", "monad", "zipper", "Functor" and "STM" (that's Software Transactional Memory for the non-lambdas) understanding the issue generally (and how to apply it to your CURRENT practices) can be challenging.

Hopefully Yuval will forgive me if I try to summarise his three posts in one sentence for the ordinary humans among us :)

State Is The Enemy

State is any data that persists in place over time, is referenced from outside the context it was created, and can change in the future.

That summarised description was stolen from one of the creators of Erlang, who realised very early in the language design process that State is what kills reliability, and that State is what kills concurrency. In creating Erlang, they specifically set out to kill off State in a way that was practical for long-running (years without shutdown) real-world applications (something Haskell seems to struggle to achieve, being written almost entirely by mathematicians).

Erlang and Haskell take the fight against State to the extreme, but if you compromise productivity for purity you will fail. You simply cannot fight economics and win in the long term, so the fight against State needs to be tempered by your ability to actually get things done. For most people, compromise is a necessity and true immutability is a luxury.

That said, there are a number of cheap and easy ways you can alter your existing practices.

1. All accessors should be readonly by default

One of the things that DBI got right was that the default way to use variables was the safe way (placeholders). Everything was documented that way, and everyone was expected to follow those rules. PHP defaulted in the opposite direction (you had to do extra work to be safe). The difference in default behaviour creates a massive difference in the safety of typical Perl SQL code and typical PHP code.

In contrast, the thing that Class::Accessor (and all it's derivatives) got wrong was to make all your accessors readwrite by default, regardless of whether or not that was actually safe. You had to do extra work (discover, learn about and then apply mk_ro_accessors instead) to go with the low-state option.

If you took a Class::Accessor object as a parameter to some function, you not only had to check it WAS an object of the type you wanted, but to be safe you also had validate that the individual properties were safe values.

Being hard work, most people don't do it properly, resulting in buggy code (or bloaty code if you actually do it properly).

By making ALL the accessors for your objects readonly by default, you are forced to do all your consistency checks at construction time. If any accessor is writable, you have to take an additional extra step to make it writable and then write the extra code to ensure that the change to the attribute does not send the object into an illegal state.

By forcing yourself to take an extra step to allow an attribute to change, you automatically gain a very powerful guarantee.

2. Every object is a legal object.

Anything sub-classing your code, or taking your objects as parameters, has only to validate that it IS an object of that type. They can have complete trust that the object will do what it is supposed to, and even if they don't understand the need to trust (or just forget to check) their inputs, they are safe anyway.

And because objects are ALWAYS correct, you can stuff them into a Storable, send them over the network, hand them off to secondary processes, and nowhere in any of this do you have to do validity checking (unless State rears it's head and the code is a different version).

The simplicity and safety you gain by making these simple changes also means that your code is smaller and run faster, which is why Object::Tinywas able to be smaller in size and significantly faster than all other object builders (and why Object::Tiny::XS continues to be faster). Everything is readonly and all objects are correct, and so all the extra work needed to deal with state just falls away.

3. Reduce your global variables to a minimum

Every global is State, and since State is the enemy it's important to try and find ways to remove them.

This does NOT mean you convert them to package variables and put setter methods around it. That's just a global variable with validated content.

Instead you either need to.

a) Move the State into your instance objects (or even your singleton/default object) so that the State is localised inside the encapsulation and State accessible from outside the encapsulation is reduced by one.

b) If the global is only for test script usage, leave it undocumented so that the impact of the State is reduced and largely contained to the test scripts.

c) Lock in the value and make it immutable at compile time.

This latter option is my favourite trick for debugging and hacks that exist specifically to support the test scripts.

You use code that looks something like this.

package Foo;
 
use vars qw{$DEBUG};
BEGIN {
    $DEBUG = 0 unless defined $DEBUG;
}
use constant DEBUG => !! $DEBUG;
 
sub foo {
    debug('In sub foo') if DEBUG;
 
    ...
}

If nobody else has a better name for this trick, I'll happily take the name "Compiled Global" to describe it.

Now you can have full debugging support in your module for privileged consumers (the author or test scripts) but ONLY if you set $Foo::DEBUG before the module is loaded.

For everyone else, the state is removed (because after compile time $DEBUG is not State, just a meaningless junk variable with no impact) and as a bonus all the debugging code is compiled out, making the run-time speed and memory cost of your debugging code zero.

The use of the Readonly module I consider evil, because while it does prevent the variable being written to, it doesn't (as far as I'm aware) give information to the compiler so it can take advantage of the immutability (I could be wrong here).

Sunday June 14, 2009
10:29 PM

Politics::AU::Geo and the Nerds for Democracy Hackathon

After many years of being less than progressive, it looks like Government in Australia is finally going through the process of being Nerdified.

This is a much needed change, because of Australia's highly participatory democracy.

We have no equivalent of the US Bill of Rights, and for many issues like freedom of speech, assembly, and so on there is nothing in our Constitution in the same way as there is in other countries.

This means that issues around very contentious subjects are usually resolved by the political process, rather than by lawyers in funny wigs. There is always a lot of lobbying going to the government by non-profits and community groups, and the lobbying often has a sense of urgency and immediacy.

And the size of our country (21 million) puts us in a range high enough to expect quality governance, but just small enough that it is reasonably plausible that a common person who takes up a particular subject with drive can be heard and provide input on a subject.

Once data is exposed by the various governments, this wide-ranging hunger for information creates fertile ground for creating software for tracking, integrating and analysing of the process of government. The Australian mentality of getting things done in low-cost and low-headcount ways helps feed this further.

This community appears to be making the phase change for scattered efforts to organised community, centred around the community-based Open Australia website.

On Saturday I attended the first ever Open Australia Hackathon (i.e. "Nerds for Democracy") which was as much about letting different parts of the community meet each other as it was about actually getting anything done on the Open Australia website itself.

My good friend and former business partner Jeffery Candiloro took the opportunity to release his pet project http://myrepresentatives.org/. This website is a demonstration of combining geo-coding with politics.

You can put in any address in Australia (including some of the more obscure island territories) and the website will resolve the address to a Geo point (using Google Maps), applies the Geo point to a polygon database to find the electorates you are part of, and then uses the list of electorates to find the people that represent each of the electorates for your address.

The result is pretty awesome.

What I've been helping with is to help him to take this first-pass implementation and convert the code into proper CPAN distributions, so that it can be integrated into other people's websites as well.

The result is a new Politics:: top level namespace, and a new Politics::AU::Geo module (which Jeffery should own, but I ended up doing the first release of) that you can use yourself.

By reusing all the things I learned with the ORDB:: modules, this small module implements the Geo resolver itself, but fetches (and caches) the polygon data as a SQLite database from the myrepresentatives.org website as needed.

The data is rather crude at the moment, and the real challenge we are facing is that there are no universal identifiers for any of the concepts involved.

I'm hoping to kick off a conversation amongst all the different parts of the community on the creation of RDF-quality identifiers for all the major concepts (houses, electorates, members, et al) that all the disparate code (in several different languages) can operate on.

Thursday June 11, 2009
10:45 AM

Source code should look professional...

Friday June 05, 2009
10:40 AM

Padre Single Instance now correctly foregrounds on Win32

Of course, I'm not at all proud of what I had to do to get there, but after around 20 minutes per line of code trawling through MSDN, I finally managed to find a combination that works.

The single instance server now immediately sends the PID on connect (leading space padded to specifically 10 bytes) and the the client does the following.

my $pid  = '';
my $read = $socket->sysread( $pid, 10 );
if ( defined $read and $read = 10 ) {
    # Got the single instance PID
    $pid =~ s/\s+\s//;
    if ( Padre::Util::WIN32 ) {
        require Win32::API;
        Win32::API->new(
            'User32.dll',
            'AllowSetForegroundWindow',
            'N', 'L',
        )->Call($pid);
    }
}

Tuesday June 02, 2009
10:49 PM

Releasing CPANDB 0.01 as my 200th module

http://svn.ali.as/cpan/releases/CPANDB-0.01.tar.gz

I've just uploaded a first iteration of a unified CPAN database and matching ORM, based on the index.

According to the unofficial CPAN Leaderboard, this will be my 200th module (either written or taken over, minus abandoned and given away modules).

As nice as it is to prevent ZOFFIX making it there first, I have to say that the number does induce a certain feeling of dread.

Fortunately, it hopefully shouldn't climb much further for a while. My current SQLite release splurge is starting to reach it's logical conclusion, and I'm going to try harder to delete or merge older packages (several previous attempts at things like CPANDB can probably start to die now, once they've had any remaining features harvested from them).