Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

sheriff_p (1577)

  (email not shown publicly)
AOL IM: sheriff123414 (Add Buddy, Send Message)

Journal of sheriff_p (1577)

Friday March 06, 2009
01:44 AM

ORA's Safari Search sucks; the rant I just sent them...

To whom it may concern,

Knowing that Safari has several books concerning Objective C, today I tried to find them using the search. After several tries, I gave up, used Google to find the books, and ended up typing in the author names.

This is obviously sub-optimal. Here are the three biggest problems with the search.

Problem the first: inappropriate stemming of search terms. If you're searching a corpus you don't know much about, then sure, some stemming is fine. If you're indexing technical books, then stemming 'Objective' to 'Object' is a poor choice. If you're going to insist on doing stemming, why not return results that contain the term unstemmed higher in the rankings?

There must be a huge archive of previous customer searches. Tokenize these in to words, and create a stop list against common ones to avoid stemming on them.

Problem the second: not specifying between different result concepts. Again, if you're Google, you don't have much choice but to return URLs (although even they make a fair stab at returning different objects: videos, images, products, etc). If you're a book publisher, you have much more scope for returning dramatically more useful results.

Specify your use-cases here. The search form actually allows people to search individual result objects, ish. There's a drop-down option for Authors, Titles, etc. However, it's an untitled form element, and the selected option there is "Entire Site". I'd suggest that the user expectation there is that clicking on that is going to offer sub-site search specializations, rather than meta-data specializations. But this is a bit of a tangent:

As a member of the search public, what am I looking for? One of: whole books based on title; whole books based on category; whole books based on author; individual sections from books on topic. Returning these mixed together makes the results confusing, and largely irrelevant. Split them out. Let's see: First three categories that match your search (more on this in a minute); First three authors that match your search; First three titles that match your search; First three chapters that match your search. This covers all your bases, whatever your user was searching for.

Problem the third: not indexing categories. There's a category called "Objective C". It didn't come up when searching for Objective C. It's also not obviously navigable to from the Safari homepage without performing a different search first, and then clicking through the returned nav on the left (where it isn't highlighted, but other categories are). If I "Browse the Safari Library", I can't drill down. What kind of browsing is that?

People may well be searching for a category. Why not index the category name itself, rather than just returning the categories the top books come from? With the stemming issues mentioned above, it means you may as well not even have the category classifications.

There are a relatively small number of books on Safari. Using a naive word-matching search when you know so much about your content already is far from ideal. You have obvious distinct objects that people are searching for - ignoring this and treating it like you're searching flat and homogeneous content is the reason the search is so totally broken.


Monday January 16, 2006
04:35 PM

Catalyst in 20 minutes

I've been spamming everywhere else, so why not here?

'The purpose of this tutorial is to teach you enough Catalyst to be dangerous, as quickly as possible. It should take less than an hour to complete. Dangerous, in this case, means "able to make use of the core documentation".'


Monday October 10, 2005
03:30 AM

Javascript Goodness

I've been doing some pretty funky stuff with Javascript recently - I had a two week gig to produce a front end to an XMLRPC interface to a client's admin system - if you're a Bytemark customer you can have a play at:

I really like how Javascript makes me think differently about some aspects of programming - the inheritance system is ... different, but kinda funky, and I've been relying on closures more than I've had to before. So if anyone knows anyone who needs some Javascript contracting done ... :-)

Wednesday May 04, 2005
04:04 AM

Filtering non-ASCII characters with procmail and mutt

Almost a year since my last entry, awesome... :-)

So here's a little snippet of my .procmailrc to remove characters I can't understand anyway from the subject and from lines, as mut is sending them straight to my terminal, and messing up my display:

# Rewrite the subject and sender to remove foreign characters
OLDSUBJECT=`/usr/local/bin/formail -xSubject:`
NEWSUBJECT=`echo $OLDSUBJECT | /usr/bin/tr -cs '\11\12\40-\176' 'Z'`

OLDSENDER=`/usr/local/bin/formail -xFrom:`
NEWSENDER=`echo $OLDSENDER | /usr/bin/tr -cs '\11\12\40-\176' 'Z'`

|/usr/local/bin/formail -i "Subject: $NEWSUBJECT"

|/usr/local/bin/formail -i "From: $NEWSENDER"

Thursday June 03, 2004
08:00 AM

mod_perl 2 guide


I'm putting together a very rudimentary mod_perl 2 tutorial / guide. The HTML sucks etc, but content suggestions are welcome ...


Thursday April 10, 2003
04:58 AM

Bug Bonanza

My bug challenge has proven to be quite popular, and the approach has picked up a couple of fans...

So here's an idea: Why Doesn't Someone(tm) create an automated system where module authors can offer bounties in a central place? Bounties per bug can be set as high as authors want, in a sort of auction fashion... Sadly, as TorgoX will no-doubt point out, I'm a bear of little action, and even less time, so I think this should become someone else's baby.

Monday April 07, 2003
06:09 AM


I released RTF::Tokenizer v1.0 last night. All hail me. I used FIGLET for the README file again... It's a massive improvement. I stole the best parts from RTF::Parser and the old RTF::Tokenizer, and got a huge speed improvement. There are over three thousand tests too. The plan is now to build RTF::Reader using it, and also rewrite RTF::Parser to use it, which is easier said than done, but, done it shall be.