Stories
Slash Boxes
Comments

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jonswar (7880)

Journal of jonswar (7880)

Monday February 04, 2008
09:43 PM

2 problems with typical perl cache usage (and 5 solutions)

A typical cache usage pattern looks like this:

   # Try to get the value from the cache
   my $result = $cache->get('key');

   # If it isn't there, compute and set in the cache
   if (!defined($result)) {
      my $result = do_something_to_compute_result();
      $cache->set('key', $result, $expiration_time);
   }

This pattern is popular because it is easy to wrap around existing code, and easy to understand.

Unfortunately, it suffers from two problems:

  • Miss Stampedes: When a cache item expires at the specified time, any processes trying to get it will start to recompute it. If it is a popular cache item and if recomputation is expensive, you may get many recomputations for the same item, which is at best wasteful and at worst can bog down a server.

    I originally recognized this problem while working on Mason's busy locking, but am obviously not alone in experiencing it. The term "miss stampede" comes from this memcached list discussion - definitely worth a read.

  • Recomputation Latency: When a cache item is recomputed, the client (whether that be a browser, command-line, whatever) has to wait for the computation to complete. Since caching keeps average latencies down, there is a tendency to ignore the unfortunate customer that gets stuck with one or more cache misses.

Here are some ways of tweaking the usage pattern above to address one or both of these problems. I've added the initials of the problems that each one addresses, and mentioned relevant features from CHI, if any.

  • Probabilistic expiration (MS)

    Instead of specifying a single expiration time, specify a range of time during which expiration might occur. Then each cache get makes an independent probabilistic decision as to whether the item has expired. The probability starts out low at the beginning of the range and increases to 1.0 at the end of the range. What this means for popular cache items is that only one or a handful of gets will most likely expire at the same time.

    CHI supports this with the expires_variance parameter. It may be passed to individual set commands or as a default for all sets. Personally, I plan to default it to 0.2 or so in almost all my caches.

    Drawbacks: Since this is probabilistic, you get no guarantee of how well stampedes will be avoided (if at all), and you have to try to guess the right variance to use.

  • Busy locks (MS)

    When a cache item expires, flag the item for a short time, either by upping its expiration time or by setting an associated value in the cache. Subsequent misses will see the flag and return the old value instead of duplicating the recompute effort.

    CHI supports this with the busy_lock parameter, stolen from Mason. It works by temporarily setting the expiration time forward by the specified amount of time.

    Drawbacks: Setting a busy lock involves a separate write. If you use this feature liberally, you'll double the number of write operations you do. Some backends will suffer from a race condition, a small window of time in which many processes may decide to recompute, before the first lock has been successfully set.

  • Background recomputation (RL)

    When a cache item expires, return the old value immediately, then kick off a recomputation in the background. This spares the client from the cost of the recompute.

    This requires a non-traditional usage pattern, since the get and set are effectively happening as part of one operation. In CHI it will look like this:

        my $result =
          $cache->compute( 'key', sub { do_something_to_compute_result() },
            $expiration_time );

    CHI already has a working compute API, but doesn't yet know how to run things in the background. Coming soon.

    Drawbacks: Requires a non-traditional and somewhat ugly code pattern; background processes are harder to track and debug.

  • External recomputation (MS + RL)

    Recompute cache items entirely from an external process, either when change events occur or when items approach their expiration time. Items never actually expire as the result of a client request. This is the most efficient and client-friendly solution, if you can manage it.

    Drawbacks: Requires extra external processes (more moving parts). Code to recompute caches must be available from the external process, which can result in some unwanted code separation, API contortions, or repetition. It is also difficult to know which items to keep repopulating, and when exactly to recompute them.

  • Externally initiated recomputation (MS + RL)

    Use a periodic external process to trigger events that will naturally utilize your caches (e.g. write a cron job that hits common pages on your website), but pass a special flag making items more likely to expire. This makes it less likely that expiration will occur during a real client request.

    This is not yet supported in CHI, but the idea would be to add some kind of easily-accessible lever to temporarily view all expiration times as reduced. e.g.

        # Reduction ends when $lex goes out of scope
        my $lex = CHI->reduced_expirations(0.5);

    Drawbacks: Requires extra external processes (more moving parts). Triggers and their run frequencies must be carefully chosen.

What other techniques have you used, and what success/failures have you had with them?

Wednesday January 23, 2008
07:57 PM

CHI: Cache Interface for Perl

CHI, a module I've been working on for a few months, has made it to CPAN:

   file: $CPAN/authors/id/J/JS/JSWARTZ/CHI-0.03.tar.gz
   size: 62313 bytes
    md5: ec828f2466ba266e11cd6d1dd5ca2913

CHI provides a unified caching API, designed to assist a developer in persisting data for a specified period of time. It is intended as an evolution of DeWitt Clinton's Cache::Cache package, adhering to the basic Cache API but adding new features and addressing limitations in the Cache::Cache implementation.

You might think of it as a fledgling "DBI for caching".

Driver classes already exist for in-process memory, plain files, memory mapped files and memcached. Other drivers such as BerkeleyDB and DBI will be coming soon. Fortunately, implementing drivers is fairly easy, on the order of creating a TIE interface to your data store.

Special thanks to the Hearst Digital Media group, where CHI was first designed and developed, for blessing the open source release of this code.

There's lots more in store for this module, so stay tuned! Feedback welcome here or on the Perl cache mailing list.

Thursday September 06, 2007
05:52 PM

A standard logging API

It seems as if every CPAN module has its own way of logging debug information and error conditions. For example:
  • LWP - activate by use'ing LWP::Debug; outputs to STDERR
  • DBI - activate by calling DBI->trace(); outputs to STDERR or a file
  • Rose::DB - activate by setting various $Debug package variables; outputs to STDERR
  • Encode::* - activate by modifying various DEBUG subroutines to return 1; outputs using warn()
  • Apache::* - activate by setting the Apache log level and restarting; outputs to the Apache logs

In addition, there must be CPAN modules that have interesting things to say but choose not to log at all, because they don't want to invent another logging mechanism or become dependent on an existing one.

This situation is pretty much the opposite of what I want when developing a large application. I want a single way to turn logging on and off, and to control where logs get sent, for all of the modules I'm using.

This being Perl, there are many fine logging frameworks available: Log::Log4perl, Log::Dispatch, Log::Handler, Log::Agent, Log::Trivial, etc. So why do CPAN modules eschew the use of these and invent their own mechanisms that are almost guaranteed to be less powerful?

  • The very existence of so many logging modules means that there is no one standard that a CPAN author would feel comfortable binding their users to. As usual, TMTOWTDI is a double-edged sword.
  • A logging framework can be a significant dependency for a module to have, easily dwarfing the size of the module itself. For small modules that want to minimize dependencies, depending on Log4perl (for example) is a non-starter.

A Common Log API

One thing to notice is that while the logging frameworks all differ in their configuration and activation API, and the set of features they support, the API to log messages is generally quite simple. At its core it consists of

  • A set of valid log levels, e.g. debug, info, warn, error, fatal
  • Methods to log a message at a particular level, e.g. $log->debug()
  • Methods to determine if a particular level is activated, e.g. $log->is_debug()

I expect most CPAN modules would happily stick to this API, and let the application worry about configuring what's getting logged and where it's going. Therefore...

Proposed Module: Log::Any

I propose a small module called Log::Any that provides this API, with no dependencies and no logging implementation of its own. Log::Any would be designed to be linked by the main application to an existing logging framework.

A CPAN module would use it like this:

    package Foo;
    use Log::Any;
    my $log = Log::Any->get_logger(category => __PACKAGE__);

    $log->error("an error occurred");

    $log->debug("arguments are: " . Dumper(\@_))
        if $log->is_debug();

By default, methods like $log->debug would be no-ops, and methods like $log->is_debug() would return false.

As a convenient shorthand, you can use

    package Foo;
    use Log::Any qw($log);

to create the logger, which is equivalent to the first example except that $log is (necessarily) a package-scoped rather than lexical variable.

How does an application activate logging? The low-level way is to call Log::Any->set_logger_factory (better name pending) with a single argument: a subroutine that takes a log category and returns a logger object implementing the standard logging API above. The log category is typically the class doing the logging, and it may be ignored.

For example, to link with Log::Log4perl:

    use Log::Any;
    use Log::Log4perl;

    Log::Log4perl->init("log.conf");
    Log::Any->set_logger_factory
       (sub { Log::Log4perl->get_logger(@_) });

To link with Log::Dispatch, with all categories going to the screen:

    use Log::Any;
    use Log::Dispatch;

    my $dispatcher = Log::Dispatch::Screen->new(...);
    Log::Any->set_logger_factory(sub { $dispatcher });

To link with Log::Dispatch, with different categories going to different dispatchers:

    use Log::Any;
    use Log::Dispatch;

    my $dispatcher_screen = Log::Dispatch::Screen->new(...);
    my $dispatcher_file   = Log::Dispatch::File->new(...);

    sub choose_dispatcher {
        my $category = shift;
        $category =~ /DBI|LWP/ ? $dispatcher_file : $dispatcher_screen;
    }
    Log::Any->set_logger_factory(\&choose_dispatcher);

This API is a little awkward for the average user. One solution is for logging frameworks themselves to provide more convenient mixins, e.g.:

   use Log::Dispatch;   # this also defines Log::Any::use_log_dispatch
   my $d = Log::Dispatch::File->new(...);
   Log::Any->use_log_dispatch($d);  # calls set_logger_factory for you

   use Log::Log4perl;   # this also defines Log::Any::use_log4perl
   Log::Any->use_log4perl();        # calls set_logger_factory for you

set_logger_factory would be implemented so as to take effect on all existing as well as future loggers. Any $log objects already created inside modules will automatically be switched when set_logger_factory is called. (i.e. $log will probably be a thin proxy object.) This means that Log::Any need not be initialized by the time it is used in CPAN modules, and it allows set_logger_factory to be called more than once per application.

Promoting Use

For Log::Any to be useful, a substantial number of modules - especially major modules - would have to adopt its use. Fortunately, with its minimal footprint and standalone nature, authors should not find Log::Any a difficult dependency to add. Existing logging mechanisms, such as LWP::Debug and $DBI::tfh, could easily be converted to write *both* to their existing output streams and to Log::Any. This would preserve backward compatibility for existing applications, but allow new applications to benefit from more powerful logging. I would be willing to submit such patches to major module authors to get things going.

Feedback welcome. Thanks!