Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

agent (5836)

agent
  agentzh@yahoo.cn
http://agentzh.spaces.live.com/

Agent Zhang (章亦春) is a happy Yahoo! China guy who loves Perl more than anything else.

Journal of agent (5836)

Friday January 20, 2006
09:03 AM

use HTTP::Proxy to log my web accessing history

[ #28415 ]

Yeah, I visit many websites everyday. what I'm wanting and what I'm always looking for is a facility to automagically keep a record of the URLs and page titles I've just accessed, so that I can analyse the history some time later to find out the focus of my interest in a particular period of time, for example. And it's very likely that I can come up with even more interesting statistical consequences.

The Mozilla browser doubtlessly gives builtin support for accessing history, but unfortunately exporting that history info is not trivial. what I want is not only the URLs, but also the corresponding page titles (if any!) and the visiting time stamp.

Several weeks ago, I happily found that the CPAN module HTTP::Proxy can come to the rescue. What I need to do is just writing several lines of Perl code using that module, running this script at the background as a local HTTP proxy server, and setting my web browser to simply use that. By doing this, my local proxy has a chance to monitor all the HTTP traffic between my browser and the Internet.

It's fun to see that my local proxy server can also use a remote proxy. so my local one then becomes a secondary proxy, no? ;D

The HTTP::Proxy module also supports logging internally, thus my code is even simpler:

use HTTP::Proxy ':log';
my $logfile = ">>$home/myproxy.log";
open my $log, $logfile or
        die "Can't open $logfile for reading: $!";
my $proxy = HTTP::Proxy->new(
        logmask => STATUS,
        logfh => $log,
);

The logmask parameter here controls what kind of things the proxy should record. the STATUS constant indicates only basic URL and response code will be logged. What I get in the log file is something like this:

[Fri Jan 13 17:25:42 2006] (1888) REQUEST: GET http://www.perl.com/
[Fri Jan 13 17:25:53 2006] (1888) RESPONSE: 200 OK
[Fri Jan 13 17:25:53 2006] (1888) REQUEST: HEAD http://www.google.com/mozilla/google.src
[Fri Jan 13 17:25:54 2006] (1888) RESPONSE: 200 OK
[Fri Jan 13 17:25:54 2006] (1888) REQUEST: GET http://www.perl.com/styles/main.css
[Fri Jan 13 17:25:55 2006] (1888) RESPONSE: 304 Not Modified ...

Hmm...very cute! However, HTTP::Proxy's builtin logging mechanism doesn't respect HTML titles. Thus I need to provide a user agent of my own:

package MyUA;
use HTTP::Proxy ':log';
use base 'LWP::UserAgent';
sub send_request {
        my ($self, $request) = @_;
        my $response;
        eval {
                $response = $self->SUPER::send_request( $request );
        };
        if ($@ and not $response) {
                return HTTP::Response->new(500, $@);
        }
        if ($response->is_success) {
                my $type = $response->header('content-type');
                if ($type and $type =~ m[text/html]i) {
                        if ($response->content =~ m[\s*(.*\S)\s*]si) {
                                $proxy->log( STATUS, 'TITLE', $1);
                        }
                }
        }
        return $response;
}

Now we have HTML titles recorded down as well, as witnessed in my log file:

[Tue Jan 17 20:33:47 2006] (2484) REQUEST: GET http://perladvent.org/2004/20th/
[Tue Jan 17 20:33:50 2006] (2484) TITLE: Perl 2004 Advent Calendar: Filesys::Virtual
[Tue Jan 17 20:33:50 2006] (2484) RESPONSE: 200 OK

Then feed the customized user agent to my HTTP::Proxy instance I created earlier:

my $agent = MyUA->new(
        env_proxy => 1,
        timeout => 100,
);
$proxy->agent( $agent );

At last, we enter an infinite loop as every http proxy server:

while (1) {
        eval { $proxy->start(); };
        warn $@ if $@;
}

That's it!

It already works for me, but there're still several pitfalls in this solution:

  • Images won't display in MS Internet Explorer (Mozilla works fine, however)
  • It seems to me that HTTP::Proxy doesn't support forking by default so it leads to poor performance if I request multiple URLs simultaneously. (BTW, Is there a way to switch to a forking engine? I can't find a word in its POD docs.)
  • SSL connection doesn't work on my box.

Have fun!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Hi! I'm the author of HTTP::Proxy. :-) Glad you like it.

    I don't think you need to define your own agent to log information. In fact, I think I should never have opened the opportunity to set your own agent. You could simply use a response filter that catches the title tag and print it in you log file.

    I also don't understand your while(1) loop. The $proxy->start() is already a while(1) loop.

    And you say that the proxy doesn't fork? That probably means you're running it under Win32, don't you? Alas, th

    • Thank you very much for your comments! Yeah, it's odd not to use the filter mechanism. Filters make things simpler.

      I'm so glad to receive feedback from you, the very author of HTTP::Proxy. :=)
    • Heeeeeeeello, I'm new to HTTP::Proxy and I was wondering if anyone if anyone could debug the following code for me... The script is written to display all actions performed while I'm browsing on the Cmd Prompt of Windows (since logfh is default to be *STDERR). use HTTP::Proxy; use HTTP::Recorder; my $proxy = HTTP::Proxy->new(logmask => ALL); $proxy->start(); For some reason, no messages are displayed even though I'm browsing like crazy. Is there anything I missed? Thanks