Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jonswar (7880)

Journal of jonswar (7880)

Friday June 26, 2009
08:57 AM

Leaving use.perl for openswartz.com

This blog has been moved to openswartz.com. Hope to see you there!
Sunday April 26, 2009
09:45 PM

Auto-wrapping subclass methods

Back in Feb I asked on various lists how I could auto-wrap CHI driver methods, but didn't get any completely satisfying answers:

CHI drivers implement methods like remove() and clear(). If you call $cache->remove(), it goes directly to the driver subclass.

The problem is that there are now legitimate reasons to "wrap" these methods at the CHI/Driver.pm superclass level (meaning, do something before and/or after the method). For example, I want to add an optional generic size-awareness feature (the cache can keep track of its own size), which means that we have to adjust size whenever remove() and clear() are called. And I want to log remove() calls the way we currently log get() and set().

So one solution is to define remove() and clear() in CHI/Driver.pm, and have them call _remove() and _clear() in the driver subclasses. But this kind of change makes me uneasy for several reasons:

  • It changes the driver API, i.e. all existing drivers out there have to modified. And we might have to change it again as we identify new methods to wrap.
  • The list of 'normal' versus 'underscore' methods becomes rather arbitrary - it's "whatever we've needed to wrap so far".

I thought about using regular wrapping modules, like Sub::Prepend or Hook::LexWrap. But this fails when you have subclasses more than one level deep. e.g.:

CHI::Driver -> CHI::Driver::Foo -> CHI::Driver::Foo::Bar

Now if you call CHI::Driver::Foo::Bar::remove(), the wrapping code will get called twice, once for each subclass. I only want it to be called once regardless of how deep the subclass is.

Here's how I solved this in CHI-0.2. When each CHI driver is used for the first time, e.g. CHI::Driver::Memory:

my $cache = CHI->new('Memory');

CHI autogenerates a new class called CHI::Wrapped::CHI::Driver::Memory, which inherits from

('CHI::Driver::Wrapper', 'CHI::Driver::Memory')

then blesses the actual cache object (and future cache objects of this driver) as CHI::Wrapped::CHI::Driver::Memory.

Now, when someone calls a method like $cache->get() or $cache->remove(), CHI::Driver::Wrapper has an opportunity to handle it first, and then pass control to CHI::Driver::Memory. If not, it goes directly to CHI::Driver::Memory.

I was unable to find this solution on CPAN, even though I feel like I must be reinventing the wheel. If someone knows of a distribution that encapsulates this technique, please let me know.

Here's the code from CHI::Driver::Wrapper that creates the wrapper class:

sub create_wrapped_driver_class {
    my ( $proto, $driver_class ) = @_;
    carp "internal class method" if ref($proto);

    if ( !$wrapped_driver_classes{$driver_class} ) {
        my $wrapped_driver_class      = "CHI::Wrapped::$driver_class";
        my $wrapped_driver_class_decl = join( "\n",
            "package $wrapped_driver_class;",
            "use strict;",
            "use warnings;",
            "use base qw(CHI::Driver::Wrapper $driver_class);",
            "sub driver_class { '$driver_class' }",
            "1;" );
        eval($wrapped_driver_class_decl);    ## no critic ProhibitStringyEval
        die $@ if $@;                        ## no critic RequireCarping
        $wrapped_driver_classes{$driver_class} = $wrapped_driver_class;
    }
    return $wrapped_driver_classes{$driver_class};
}

And here's the first application of auto-wrapping: when certain methods are called on a cache, automatically call them on the subcaches, if any.

# Call these methods first on the main cache, then on any subcaches.
#
foreach my $method (qw(remove expire expire_if clear purge)) {
    no strict 'refs';
    *{ __PACKAGE__ . "::$method" } = sub {
        my $self = shift;
        my $retval = $self->call_native_driver( $method, @_ );
        $self->call_method_on_subcaches( $method, @_ );
        return $retval;
    };
}

# Call the specified $method on the native driver class, e.g. CHI::Driver::Memory.  SUPER
# cannot be used because it refers to the superclass(es) of the current package and not to
# the superclass(es) of the object - see perlobj.
#
sub call_native_driver {
    my $self                 = shift;
    my $method               = shift;
    my $native_driver_method = join( "::", $self->driver_class, $method );
    $self->$native_driver_method(@_);
}

Saturday April 25, 2009
11:19 PM

CHI 0.2 - Subcaches

I've just released CHI 0.2. The main visible change is that multi-level caches have been fleshed out and made easier to use.

There are two kinds of multi-level relationships that I wanted to be able to express easily with CHI:

  • L1 (level 1) cache: Sits in front of the primary cache in order faster access for commonly accessed cache entries. i.e. a cache for your cache.
  • Mirror cache: Sits behind the primary cache and, over time, mirrors its contents. Useful for migrating from one cache to another without a sudden performance hit.

Initially CHI had a Multilevel driver that would let you place two or more caches inside a container cache object. The problem was that adding an L1 to an existing cache required changing it to a Multilevel cache, causing existing driver-specific calls to fail. (e.g. If I change a File cache to a Multilevel cache, File-specific methods will no longer get handled right.)

In 0.2 I switched to a primary cache / subcache model, which seems more appropriate. Now the File cache has an L1 subcache, and File-specific methods (as well as many ancillary methods on which the L1 relationship has no clear meaning) simply go to the primary cache.

The usage is also simpler. Here we place an in-process Memory cache in front a Memcached cache:

    my $cache = CHI->new(
        driver   => 'Memcached',
        servers  => [ "10.0.0.15:11211", "10.0.0.15:11212" ],
        l1_cache => { driver => 'Memory' }
    );

Note that there isn't a way yet to specify a size limit for the memory cache, which would make this a lot more self-maintaining. :) That's coming soon. In the meantime, I'm planning to use this for an unlimited request-based cache, clearing it manually at the end of each web request:

    $cache->l1_cache->clear();

Here we prepare to migrate from an old to a new cache directory:

    my $cache = CHI->new(
        driver          => 'File',
        root_dir        => '/old/cache/root',
        mirror_cache => { driver => 'File', root_dir => '/new/cache/root' },
    );

We leave this running for a few hours (or as needed), then replace it with

    my $cache = CHI->new(
        driver   => 'File',
        root_dir => '/new/cache/root'
    );

More details in the Subcaches section of the CHI 0.2 documentation.

Monday February 04, 2008
08:43 PM

2 problems with typical perl cache usage (and 5 solutions)

A typical cache usage pattern looks like this:

   # Try to get the value from the cache
   my $result = $cache->get('key');

   # If it isn't there, compute and set in the cache
   if (!defined($result)) {
      my $result = do_something_to_compute_result();
      $cache->set('key', $result, $expiration_time);
   }

This pattern is popular because it is easy to wrap around existing code, and easy to understand.

Unfortunately, it suffers from two problems:

  • Miss Stampedes: When a cache item expires at the specified time, any processes trying to get it will start to recompute it. If it is a popular cache item and if recomputation is expensive, you may get many recomputations for the same item, which is at best wasteful and at worst can bog down a server.

    I originally recognized this problem while working on Mason's busy locking, but am obviously not alone in experiencing it. The term "miss stampede" comes from this memcached list discussion - definitely worth a read.

  • Recomputation Latency: When a cache item is recomputed, the client (whether that be a browser, command-line, whatever) has to wait for the computation to complete. Since caching keeps average latencies down, there is a tendency to ignore the unfortunate customer that gets stuck with one or more cache misses.

Here are some ways of tweaking the usage pattern above to address one or both of these problems. I've added the initials of the problems that each one addresses, and mentioned relevant features from CHI, if any.

  • Probabilistic expiration (MS)

    Instead of specifying a single expiration time, specify a range of time during which expiration might occur. Then each cache get makes an independent probabilistic decision as to whether the item has expired. The probability starts out low at the beginning of the range and increases to 1.0 at the end of the range. What this means for popular cache items is that only one or a handful of gets will most likely expire at the same time.

    CHI supports this with the expires_variance parameter. It may be passed to individual set commands or as a default for all sets. Personally, I plan to default it to 0.2 or so in almost all my caches.

    Drawbacks: Since this is probabilistic, you get no guarantee of how well stampedes will be avoided (if at all), and you have to try to guess the right variance to use.

  • Busy locks (MS)

    When a cache item expires, flag the item for a short time, either by upping its expiration time or by setting an associated value in the cache. Subsequent misses will see the flag and return the old value instead of duplicating the recompute effort.

    CHI supports this with the busy_lock parameter, stolen from Mason. It works by temporarily setting the expiration time forward by the specified amount of time.

    Drawbacks: Setting a busy lock involves a separate write. If you use this feature liberally, you'll double the number of write operations you do. Some backends will suffer from a race condition, a small window of time in which many processes may decide to recompute, before the first lock has been successfully set.

  • Background recomputation (RL)

    When a cache item expires, return the old value immediately, then kick off a recomputation in the background. This spares the client from the cost of the recompute.

    This requires a non-traditional usage pattern, since the get and set are effectively happening as part of one operation. In CHI it will look like this:

        my $result =
          $cache->compute( 'key', sub { do_something_to_compute_result() },
            $expiration_time );

    CHI already has a working compute API, but doesn't yet know how to run things in the background. Coming soon.

    Drawbacks: Requires a non-traditional and somewhat ugly code pattern; background processes are harder to track and debug.

  • External recomputation (MS + RL)

    Recompute cache items entirely from an external process, either when change events occur or when items approach their expiration time. Items never actually expire as the result of a client request. This is the most efficient and client-friendly solution, if you can manage it.

    Drawbacks: Requires extra external processes (more moving parts). Code to recompute caches must be available from the external process, which can result in some unwanted code separation, API contortions, or repetition. It is also difficult to know which items to keep repopulating, and when exactly to recompute them.

  • Externally initiated recomputation (MS + RL)

    Use a periodic external process to trigger events that will naturally utilize your caches (e.g. write a cron job that hits common pages on your website), but pass a special flag making items more likely to expire. This makes it less likely that expiration will occur during a real client request.

    This is not yet supported in CHI, but the idea would be to add some kind of easily-accessible lever to temporarily view all expiration times as reduced. e.g.

        # Reduction ends when $lex goes out of scope
        my $lex = CHI->reduced_expirations(0.5);

    Drawbacks: Requires extra external processes (more moving parts). Triggers and their run frequencies must be carefully chosen.

What other techniques have you used, and what success/failures have you had with them?

Wednesday January 23, 2008
06:57 PM

CHI: Cache Interface for Perl

CHI, a module I've been working on for a few months, has made it to CPAN:

   file: $CPAN/authors/id/J/JS/JSWARTZ/CHI-0.03.tar.gz
   size: 62313 bytes
    md5: ec828f2466ba266e11cd6d1dd5ca2913

CHI provides a unified caching API, designed to assist a developer in persisting data for a specified period of time. It is intended as an evolution of DeWitt Clinton's Cache::Cache package, adhering to the basic Cache API but adding new features and addressing limitations in the Cache::Cache implementation.

You might think of it as a fledgling "DBI for caching".

Driver classes already exist for in-process memory, plain files, memory mapped files and memcached. Other drivers such as BerkeleyDB and DBI will be coming soon. Fortunately, implementing drivers is fairly easy, on the order of creating a TIE interface to your data store.

Special thanks to the Hearst Digital Media group, where CHI was first designed and developed, for blessing the open source release of this code.

There's lots more in store for this module, so stay tuned! Feedback welcome here or on the Perl cache mailing list.

Thursday September 06, 2007
04:52 PM

A standard logging API

It seems as if every CPAN module has its own way of logging debug information and error conditions. For example:
  • LWP - activate by use'ing LWP::Debug; outputs to STDERR
  • DBI - activate by calling DBI->trace(); outputs to STDERR or a file
  • Rose::DB - activate by setting various $Debug package variables; outputs to STDERR
  • Encode::* - activate by modifying various DEBUG subroutines to return 1; outputs using warn()
  • Apache::* - activate by setting the Apache log level and restarting; outputs to the Apache logs

In addition, there must be CPAN modules that have interesting things to say but choose not to log at all, because they don't want to invent another logging mechanism or become dependent on an existing one.

This situation is pretty much the opposite of what I want when developing a large application. I want a single way to turn logging on and off, and to control where logs get sent, for all of the modules I'm using.

This being Perl, there are many fine logging frameworks available: Log::Log4perl, Log::Dispatch, Log::Handler, Log::Agent, Log::Trivial, etc. So why do CPAN modules eschew the use of these and invent their own mechanisms that are almost guaranteed to be less powerful?

  • The very existence of so many logging modules means that there is no one standard that a CPAN author would feel comfortable binding their users to. As usual, TMTOWTDI is a double-edged sword.
  • A logging framework can be a significant dependency for a module to have, easily dwarfing the size of the module itself. For small modules that want to minimize dependencies, depending on Log4perl (for example) is a non-starter.

A Common Log API

One thing to notice is that while the logging frameworks all differ in their configuration and activation API, and the set of features they support, the API to log messages is generally quite simple. At its core it consists of

  • A set of valid log levels, e.g. debug, info, warn, error, fatal
  • Methods to log a message at a particular level, e.g. $log->debug()
  • Methods to determine if a particular level is activated, e.g. $log->is_debug()

I expect most CPAN modules would happily stick to this API, and let the application worry about configuring what's getting logged and where it's going. Therefore...

Proposed Module: Log::Any

I propose a small module called Log::Any that provides this API, with no dependencies and no logging implementation of its own. Log::Any would be designed to be linked by the main application to an existing logging framework.

A CPAN module would use it like this:

    package Foo;
    use Log::Any;
    my $log = Log::Any->get_logger(category => __PACKAGE__);

    $log->error("an error occurred");

    $log->debug("arguments are: " . Dumper(\@_))
        if $log->is_debug();

By default, methods like $log->debug would be no-ops, and methods like $log->is_debug() would return false.

As a convenient shorthand, you can use

    package Foo;
    use Log::Any qw($log);

to create the logger, which is equivalent to the first example except that $log is (necessarily) a package-scoped rather than lexical variable.

How does an application activate logging? The low-level way is to call Log::Any->set_logger_factory (better name pending) with a single argument: a subroutine that takes a log category and returns a logger object implementing the standard logging API above. The log category is typically the class doing the logging, and it may be ignored.

For example, to link with Log::Log4perl:

    use Log::Any;
    use Log::Log4perl;

    Log::Log4perl->init("log.conf");
    Log::Any->set_logger_factory
       (sub { Log::Log4perl->get_logger(@_) });

To link with Log::Dispatch, with all categories going to the screen:

    use Log::Any;
    use Log::Dispatch;

    my $dispatcher = Log::Dispatch::Screen->new(...);
    Log::Any->set_logger_factory(sub { $dispatcher });

To link with Log::Dispatch, with different categories going to different dispatchers:

    use Log::Any;
    use Log::Dispatch;

    my $dispatcher_screen = Log::Dispatch::Screen->new(...);
    my $dispatcher_file   = Log::Dispatch::File->new(...);

    sub choose_dispatcher {
        my $category = shift;
        $category =~ /DBI|LWP/ ? $dispatcher_file : $dispatcher_screen;
    }
    Log::Any->set_logger_factory(\&choose_dispatcher);

This API is a little awkward for the average user. One solution is for logging frameworks themselves to provide more convenient mixins, e.g.:

   use Log::Dispatch;   # this also defines Log::Any::use_log_dispatch
   my $d = Log::Dispatch::File->new(...);
   Log::Any->use_log_dispatch($d);  # calls set_logger_factory for you

   use Log::Log4perl;   # this also defines Log::Any::use_log4perl
   Log::Any->use_log4perl();        # calls set_logger_factory for you

set_logger_factory would be implemented so as to take effect on all existing as well as future loggers. Any $log objects already created inside modules will automatically be switched when set_logger_factory is called. (i.e. $log will probably be a thin proxy object.) This means that Log::Any need not be initialized by the time it is used in CPAN modules, and it allows set_logger_factory to be called more than once per application.

Promoting Use

For Log::Any to be useful, a substantial number of modules - especially major modules - would have to adopt its use. Fortunately, with its minimal footprint and standalone nature, authors should not find Log::Any a difficult dependency to add. Existing logging mechanisms, such as LWP::Debug and $DBI::tfh, could easily be converted to write *both* to their existing output streams and to Log::Any. This would preserve backward compatibility for existing applications, but allow new applications to benefit from more powerful logging. I would be willing to submit such patches to major module authors to get things going.

Feedback welcome. Thanks!