Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

acme (189)

acme
  (email not shown publicly)
http://www.astray.com/

Leon Brocard (aka acme) is an orange-loving Perl eurohacker with many varied contributions to the Perl community, including the GraphViz module on the CPAN. YAPC::Europe was all his fault. He is still looking for a Perl Monger group he can start which begins with the letter 'D'.

Journal of acme (189)

Tuesday May 02, 2006
08:47 AM

Xapian

[ #29490 ]

As I mentioned recently on the mightyv blog, I've added full-text searching to mightyv. This enables you to find programmes containing pies. I've been toying with all these full-text search engines and recently decided upon the Xapian project. While I've toyed with doing this in Perl-space in the past, you really want to do this in C-space so that it is lightning-fast.

Rather kindly, the Xapian project supplies Debian and Ubuntu packages for the latest version and there is the rather under-documented Search::Xapian module as an interface to it.

Playing with Xapian, I've found that it creates small indexes and is really very fast indeed. It is best to use the Flint backend ($ENV{XAPIAN_PREFER_FLINT} = 1;) and I like the stemming code. For example, the xapian-compact-ed index for title and categories for 180k recipes is 37M and I can search for "killer salsa" in 7ms. Creating and updating the index is a little tricky (but you can update while reading from it, unlike Plucene), so after a little more experience I might well release a Search::Xapian::Simple which will just do the right thing for the common case.

Basically, it's fast and neat. What do you use for full-text searches?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • We've been using swish-e. http://www.swish-e.org/ [swish-e.org]. Easy to setup and install, and it's fast.
    • It's hard to tell, but looks like swish-e is only set up to index files. I don't have files!
      • The last time I've used swish-e you could call some external programm to 'fake' files. Something like swish-e -S prog.
      • Last time I used Swish-E I was indexing files, but they included things like MS Word and PDF documents so we used an indexing script to filter the files through X_to_text programs and feed the results to the indexer. There's no reason why your indexing script couldn't get its data from DBI or similar rather than files.

        The other thing I liked about the Swish-E indexing process was that you could feed arbitrary metadata fields to the indexer. This allowed you to get things like author name, publication da

        • Right, Xapian allows you to store abitrary metadata and works find under incremental indexing, which I consider key.
    • Is Swish-E working for you with Unicode? We've found it unsatisfactory once non-Ascii characters are being used in the data being searched (and the search terms and results).

      Smylers

  • We liked what Lucene had to offer, but Plucene left much to be desired. So, we ended up creating a java servlet so we could use Lucene proper as a web service ( lucene-ws.net [lucene-ws.net]).

    There's a Perl client in the SVN repository, though it requires an as-yet-unreleased version of WWW::OpenSearch. Indexing is a bit slow mostly due to the HTTP overhead, but searching is pretty slick and it now includes search suggestions.

    We'd like to replace it, eventually, with something more native to Perl. KinoSearch [rectangular.com] is relatively

  • HyperEstraier [sf.net] with a little help from Search::Estraier [cpan.org] fits my needs quite nicely.

    I started using search engines with swish-e (which I still use quite a bit), but threre is also another very interesting project: KinoSearch [cpan.org] which looks very promising full control from perl is required (it somewhat reminds me of WAIT which powered CPAN).