Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Purdy (2383)

Purdy
  {jason} {at} {purdy.info}
http://purdy.info/
AOL IM: EmeraldWarp (Add Buddy, Send Message)
Yahoo! ID: jpurdy2 (Add User, Send Message)

Bleh - not feeling creative right now. You can check me out on PerlMonks [perlmonks.org].

Journal of Purdy (2383)

Friday March 28, 2008
05:00 PM

Robots.txt Tip + Webinale.de

Knocking off some of the dust here... wanted to share two quickies:

I finally figured this out, and it may be beneath you, but let's say you have a web document root that's shared between ports 80 and 443 (iow, http and https go to the same place). Then your site gets spidered by the search engines and they put a bunch of your stuff in a supplemental index because it's redundant. Since there's only one robots.txt file, you can't easily say IF https, then go away w/o saying the same thing to the http version. So what do you do? Create a robots-ssl.txt file and then in your ssl apache configuration, use Alias:

Alias /robots.txt /path/to/robots-ssl.txt

Then http://www.example.com/robots.txt and https://www.example.com/robots.txt have two different contents, while sharing the same web root directory!

You probably already knew this, didn't you?

Ok, next up ... in more exciting news, I'm speaking at webinale.de! I submitted 6 talks, 5 technical and 1 marketing ... and wouldn't you know it? The marketing talk is approved. So I'll be speaking on SEO. I've been listening & learning German via the Deutche Welle podcasts and I'm lurking on #perlde to pick up reading & writing German, too. Thankfully, I'll be able to do my presentation in English (I think it would be torture to listen to my German ;)).

Peace,

Jason

PS: How ironic (ok, English nerds, coincidental ;)) is it that I'm listening to Daft Punk atm?
Thursday November 29, 2007
10:39 AM

The next thing CPAN needs...

I had a recent experience that gave me pause (pun intended!) and inspiration for a helpful CPAN tool: a categorical tabulation of module popularity.

My example: about a year ago (maybe more), I first dabbled into AJAX and went looking for a module that would import/export native Perl objects in JSON format. So a CPAN search pointed me to the JSON module.

Fast forward to yesterday, where I'm pointed to JSON::Syck, which fits my simple needs, but more importantly (and objectively), is faster & more memory efficient.

So I wish there was something where I could search for JSON, CSV, DBI, CGI, etc and then search.cpan.org would recognize that as a category and present objective data that would better direct me (and other developers) to the most popular (and often best) selection.

An initial objection would be flamewars, but if we kept it to pure numbers and tied it to BitCard logins, the numbers would be objective themselves. Of course, people will probably want to make comments and that's where it could get awkward.

Another objection might be something already exists, whether it be the rating system or this wiki page, but they both don't fit this need, IMO, basically because a sense of popularity isn't thrown behind the modules.

Another objection might be upstart modules would find it harder to be adopted, but if we allow people to change their "allegiance", upstart modules' new votes would become more substantial quickly.

Maybe even simpler would be a download tracker in CPAN to tell how many times a CPAN module has been downloaded/installed. Then put those #'s in the search results.

Now who's going to put the perspiration behind the inspiration? ;)

Peace,

Jason

PS: I highly recommend Daft Punk Alive 2007 - score it for $9 in MP3 format. It's awesome coding music!
Thursday September 13, 2007
10:11 AM

Math is hard...

especially when it involves dates & times. This random musing comes from upgrading CAP::Session and remembering the pain of writing tests for session_delete.

At this point in our lives, time is intuitive. Computers don't come pre-installed with the human experience, much like children. With my oldest daughter (age 3), she's currently stuck on everything in the past happened "yesterday", regardless of if it actually was yesterday or last month or two minutes ago.

Major props to those CPAN modules that get it right (DateTime being my current favorite) ... it's harder than you think.

I don't suggest those of you who are childless to rush out and get a child, but children do bring a real life analogy to what your computer is like. Now I just need to finish this Potty_Training 2.0 upgrade.

Peace,

Jason
Wednesday January 24, 2007
02:37 PM

This post brought to you by ...

ActiveState has released version 4.0 of their Komodo IDE, which supports multiple languages (Perl, PHP, etc), including TemplateToolkit. It has tons of other features, including vi key bindings and extensive configuration options to make it bend to your will. I've been playing around with the betas for the last few months and the new excitement for me is the capability to edit files remotely over SSH.

They also have XPI support, so we could develop add-ons much like those for Firefox & Thunderbird.

I was surprised that my fellow folks on #cgiapp haven't heard about it, so I wanted to share its goodness here. They also have a free version with some functionality stripped out. If you're using some other editor (Crimson Editor, jEdit, etc), Komodo (Edit) is definitely worth your replacement consideration.

Also to share my latest webdev bounty, make sure you have Firebug. There was a great article in Dr. Dobbs about how it can be used and I've found it priceless when trying to debug javascript & css/layout stuff.

Anyway, enough shilling for now. ;)

Peace,

Jason
Tuesday October 10, 2006
09:38 AM

Perl needs (more) evangelism

I was having lunch with a programmer friend of mine, who does his work in .NET (C#, I believe) and we got into another 'Why Perl? Why .NET?' diatribe, which really went nowhere[1]. The sticking point to me was that while Perl is a great language/platform to immerse yourself in, the cool/new stuff leaves Perl behind.

This idea was enforced by a recent Slashdot story, where an aspiring student picked great programmers to ask questions, but Larry Wall didn't make the cut. Not that Larry isn't great, but that Perl doesn't have the mindshare such that it made the student's list. Hopefully, Larry didn't get the email & ignore it. ;)

Topcoder is a neat site where programmers can compete, but they only support Java, .NET, C++, but not Perl.

Google has code competitions which include Python, but not Perl. They have a neat Desktop system you can develop on, but not in Perl.

You can develop extensions for Firefox/Thunderbird, but not in Perl.

I'm probably not saying anything that hasn't already been said, but I'm worried about being the guy scrounging for jobs when I'm 50 and too set in my ways to learn yet another language, when all these cool/new things are the now/then standard.

We need to get Perl embedded into these cool/new things so that we never have to leave the comfy confines of the language to not only get the job done, but do some cool stuff, too.

Peace,

Jason

[1]: This leads me to yet another lesson I've learned - you learn more from listening than talking. There is no real truth that can win an emotional/instinctual/behavioral/spiritual argument. Watching Pudge & Ovid go at it enforces that lesson. ;)
Thursday September 28, 2006
09:45 AM

Free lesson for you...

I have been working pretty hard on a work project for about a month now (off & on and for the last week, mostly on) and I've come to a realization that perhaps most of you already have.

For the impatient, here's the lesson up-front:

When tasked with importing data from an external source, consider importing it into a separate db/table and then building/extending the necessary functionality on top of that (versus importing the data right into your existing data).
We run circulation data for two magazines, both on systems we built ourselves. It was decided to outsource circulation of one of these to another vendor. Several months later, I was asked to import their data for some functionality that the vendor doesn't provide. As it turns out, there are some similarities between the schemas of their data and ours, but mostly differences. A big one being when a user renews their subscription, I treat that as a separate subscription and they treat that as an extension of their existing subscription.

Anyway, like I said, I've been working hard and feel like I'm currently at 85 or 90% completion, but to nail the final non-conformities would require user-specific code, which makes my eyes bleed when I'm already facing some messed-up code (5 or 6 main IF branches and a few places that could be re-factored).

Perhaps this is when I should move to logic programming vs. functional programming, but I still don't have my head around that one. :(

Perhaps also, I'm just at a point that all programmers reach when they've invested too much time in a project and question "WHY?" and I should just buckle down and knock out the remaining 10-15% cases.

Cheers,

Jason
Tuesday August 15, 2006
10:14 AM

Codestorm

I recently had a situation come up where I had to whip up some code to split up a huge (1 GB) mbox file. I KNOW I should be using mdir, but com'on, people ... it's what Debian does by default and I don't spend my time sysadmin'ing stuff. In looking around, I couldn't believe others hadn't already done this (perhaps they have and my Google-fu just wasn't adequate). There was a promising git-mailsplit program, but I couldn't find it in Debian.

So I whipped this up - feel free to use/tweak this for your own use:


#!/usr/bin/perl -wT

# Process:
# 1) cp /var/mail/person /var/mail/person.bak
# 2) Run this script
# 3) chmod/chown the INBOX.GigSplitNN files
# chown person:users /home/person/INBOX.GigSplit*
# chmod 0600 /home/person/INBOX.GigSplit*
# 4) mv /var/mail/person /var/mail/person.prerm
# 5) mail the person and see if the /var/mail/person gets setup right
# 6) diff /var/mail/person.bak and /var/mail/person.prerm and put that in /var/mail/person
# i ended up just tailing the file with the right number of differing lines
# and >>'ing that into /var/mail/person
# b/c diff'ing two 1GB files takes WAY too long!

use strict;

open( MBOX, '/var/mail/person.bak' ) || die "Cannot open person.bak: $!";

# go through the mbox file
my $message = '';
my $line_count = 0;
my $message_count = 0;
my $file_base = '/home/person/INBOX.GigSplit';
my $file_i = 1;
my $line_count_limit = 580000; # this ends up with ~40MB files, which are more tolerable
my $need_to_write_init = 1;

while( <MBOX> ) {
        $line_count++;
        if ( /^From / ) {
                if ( length( $message ) > 0 ) {
                        $message_count++;
                        my $file = $file_base . sprintf( "%02d", $file_i );
                        print "Got message # $message_count - appending to $file ...\n";
                        if ( $need_to_write_init ) {
                                write_initial_msg( $file );
                                $need_to_write_init = 0;
                        }
                        open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!";
                        print SPLIT $message;
                        close( SPLIT );
                        if ( $line_count > $line_count_limit ) {
                                print "Line Count exceeded $line_count_limit, so incrementing \$file_i...\n";
                                $file_i++;
                                $line_count = 0;
                                $need_to_write_init = 1;
                        }
                }
                $message = $_;
        } else {
                $message .= $_;
        }
}

close( MBOX );

print "All done!\n";

sub write_initial_msg {
        my $file = shift;
        open( FILE, ">$file" ) || die "Cannot open $file to put in initial msg: $!";
        print FILE <<"_EOF_";
From MAILER-DAEMON Mon Aug 14 13:00:31 2006
Date: 14 Aug 2006 13:00:31 -0400
From: Mail System Internal Data <MAILER-DAEMON\@mail.example.com>
Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
Message-ID: <1155574831\@mail.example.com>
X-IMAP: 1134739889 0000025473
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.

_EOF_
        close( FILE );
}


So that will create INBOX.GigSplit01 ... INBOX.GigSplitNN, which my user could manage with Squirrelmail (I had to hack /home/person/.mailboxlist to add those new folders). Since the problem stemmed from a checkbox in her email client keeping old messages on the server and not removing them, she could simply delete a lot of the stuff as redundant and just look at the more recent messages for stuff she missed. Remotely accessing a 1GB mbox file tends to timeout. ;)

Yes, I KNOW that could be optimized and probably even one in one line (go for it, golfers!) ... it was something I had to do and it wasn't too painful to run (6700 msgs in 2 minutes).

That's just the way I roll!* ;)

Speaking of coding, Google has their Code Jam going on, but where's the love for Perl? You can program in C++, C#, Java, Python and VB.NET, but not Perl. It probably has to do with what TopCoder supports, but something should really be done to get Perl in that list, for longevity sake.

Peace,

Jason

* = My new favorite saying
Thursday May 04, 2006
09:25 AM

Competition & AJAX

So I was playing around with the Discussion2 stuff, which is pretty neat, but it got me thinking about the whole AJAX stuff and competition in general.

Competition

Slashdot (and Slashcode) seems to be really evolving lately, with tagging, CSS and improved commenting system. I don't know this for sure, but I gotta think that it's because Digg has put on some competitive pressure.

I always thought Slashdot was the 2-ton gorilla in the room that no one could mess with, but it goes to show that there's always a way to topple the giant.

I'm not saying that Slashdot is dead - there's a place for editorial control (save for April 1st), but they've certainly lost a lot of power to Digg, at least IMHO.

AJAX Thoughts

Some people refer to the onslaught of AJAX as AJAXturbation, which is crude, but seems to really get at the heart of current approaches.

Opening up your web application to AJAX techniques dramatically (and exponentially) increases the amount of traffic between the user and your server(s). So while we've saved bandwidth by converting from tables to CSS, we're going back with these little bursts of requests and responses as the user is on one page.

Another random thought is web analytics and statistics - do these AJAX requests/responses affect the stats ... should they? i.e. Does Digg tell their advertisers that a person landing on their homepage and digging two stories is 1 page view or 3?

Also, while AJAX is way cool to work with and that alone is a factor for so much of it out there, how will this affect JavaScript's presence and dependence and are developers really thinking out the logic of using it?

For example (and this is probably not the best example), pudge mentioned you can click on the 'read further' link and viola! AJAX will bring the rest of the comment into view without the fuss of going through a page refresh. My point is that this will lead to a user playing around with it more and thinking less of the "cost" of clicking those links. So I (the user) play around with hiding/showing comments with less concern, devaluing the content and at the same time, hammering the server with these tiny requests.

Probably a better example would be the Wall St. Journal's recent right-click search. That will add to a lot of playing around and at the same time, it's annoying to have two right-click context menus.

Let's hope with all these requests flying all over the place that the Net Neutrality Bill passes (and that the DRM/Broadcast Flag people don't try to slip in some of their wishes).

But maybe that's the way the Web (2.0!) has to go, in order to become the next OS... what do you think?

Peace,

Jason
Tuesday March 28, 2006
04:29 PM

FF Extension & V

It's been quite a while since my last posting. Wanted to write about two different things and save catching up with the other stuff until later.

Firefox Extension

Back in January, I had an itch to scratch where I thought it would be cool to have some sort of statistical monitor of Firefox's cache in the statusbar. So I dived into extension development and with the help of docs & the #extdev IRC channel, cranked out the Cache Status extension.

Firefox extensions are merely XML and JavaScript (oh, and a manifest file). They could interface with webapps (written in Perl, of course) to pull down information from the Web somewhere.

I will tell you that writing an extension is not for the faint of heart. There are lots of little gotchas in the development cycle. You also get wrapped up into how cool it is and then come the negative people with their own baggage, dissing your work. This gives me insight into what it means to be an open source developer – it's not all roses; the heckles & negativity can seem to outweigh any praise you may garner. Have you appreciated your open source developer today?

Perhaps we should establish a new (inter)national holiday: “Open Source Developer Appreciation Day.” I'm not saying this for me; before this experience and other maturation aspects, I, too, was guilty of heaping on the negativity. Either learn the lesson or walk a mile (1k lines of code should qualify) in OS shoes and you'll be appreciating OSD's out there everywhere, especially for those projects you use.

One neat aspect of doing this type of project is that your work is readily translated. I have 10 different languages already along with submitted work for a few more, when I get around to it. That's pretty cool!

V for Vendetta

We just came back from NYC and since we were sans-Meredith (staying with the grandparents), we opted to see “V for Vendetta” at a movie theater on opening day! I don't know when the last time we did something like that as parents, so temper my rating with the excitement of actually going to a movie. We both thought it was a great movie.

It also got me to thinking about our government in the US and how scary the current situation is. The movie paints a picture of where things lie at the end of the slippery slope that we seem to be on.

Where will the US be in 50 years? With the Patriot Act and current government infractions of civil liberties as well as other bills in the pipes that threaten other freedoms, it's too easy to glimpse reality from V's fictional future.

I don't have a solution – I know what would be ideal, but I believe we're just too fat & lazy (figuratively) to get involved unless we face a major threat (and yeah, I include myself in that).

Peace,

Jason

Tuesday September 27, 2005
04:47 PM

Goal Update - New CPAN Module

Got four more books down:

I'm in the middle of two more books (Big Bad Wolf on audio CD for my daily commute and a signed first edition copy of the Trudeau Vector). I'm currently at 13 books (and I'll be at 15 prolly next week) and I will make my goal of 20.

Looking back at those goals, I've nailed the weight (still @ 202) and books. I've given up the MythTV boxen - just don't have that type of disposable income lying around.

I've also unveiled my most weighty CPAN module to date: CGI::Application::Plugin::MessageStack. This module/plugin works with cgiapp and gives you a place to push error or informational messages, which will then be automatically inserted into your runmodes.

It has over 60 tests and I developed it using a test-driven methodology (I have a testplan.txt in my 't' dir). Wrote an RFC, the docs through a Wiki, the tests and then the code. It was a really neat process.

I've also been polishing my other modules with a motivation towards improving my Kwalitee. There's just that pesky CGI::MxScreen I adopted that would require a lot of work to improve and I'm not convinced it would be worth it. I guess it will always be my albatross.

I've also been working on a freelance project to address one of my other goals (publishing tools) - taking the code and making it work for other publishers. With this example, I took our OneSource application code and extended it such that the Duke TIP EOG program can allow their listings to be collected online from their advertisers directly. It's coming along nicely and I've learned a few things along the way (CSS, client management, sans-serif vs. serif come to mind). Good lessons and formulative experience to apply when I strike out on my own.

Peace,

Jason