Yeah, I visit many websites everyday. what I'm wanting and what I'm always looking for is a facility to automagically keep a record of the URLs and page titles I've just accessed, so that I can analyse the history some time later to find out the focus of my interest in a particular period of time, for example. And it's very likely that I can come up with even more interesting statistical consequences.
The Mozilla browser doubtlessly gives builtin support for accessing history, but unfortunately exporting that history info is not trivial. what I want is not only the URLs, but also the corresponding page titles (if any!) and the visiting time stamp.
Several weeks ago, I happily found that the CPAN module HTTP::Proxy can come to the rescue. What I need to do is just writing several lines of Perl code using that module, running this script at the background as a local HTTP proxy server, and setting my web browser to simply use that. By doing this, my local proxy has a chance to monitor all the HTTP traffic between my browser and the Internet.
It's fun to see that my local proxy server can also use a remote proxy. so my local one then becomes a secondary proxy, no?
The HTTP::Proxy module also supports logging internally, thus my code is even simpler:
use HTTP::Proxy ':log';
my $logfile = ">>$home/myproxy.log";
open my $log, $logfile or
die "Can't open $logfile for reading: $!";
my $proxy = HTTP::Proxy->new(
logmask => STATUS,
logfh => $log,
);
The logmask parameter here controls what kind of things the proxy should record. the STATUS constant indicates only basic URL and response code will be logged. What I get in the log file is something like this:
[Fri Jan 13 17:25:42 2006] (1888) REQUEST: GET http://www.perl.com/
[Fri Jan 13 17:25:53 2006] (1888) RESPONSE: 200 OK
[Fri Jan 13 17:25:53 2006] (1888) REQUEST: HEAD http://www.google.com/mozilla/google.src
[Fri Jan 13 17:25:54 2006] (1888) RESPONSE: 200 OK
[Fri Jan 13 17:25:54 2006] (1888) REQUEST: GET http://www.perl.com/styles/main.css
[Fri Jan 13 17:25:55 2006] (1888) RESPONSE: 304 Not Modified
Hmm...very cute! However, HTTP::Proxy's builtin logging mechanism doesn't respect HTML titles. Thus I need to provide a user agent of my own:
package MyUA;
use HTTP::Proxy ':log';
use base 'LWP::UserAgent';
sub send_request {
my ($self, $request) = @_;
my $response;
eval {
$response = $self->SUPER::send_request( $request );
};
if ($@ and not $response) {
return HTTP::Response->new(500, $@);
}
if ($response->is_success) {
my $type = $response->header('content-type');
if ($type and $type =~ m[text/html]i) {
if ($response->content =~ m[\s*(.*\S)\s*]si) {
$proxy->log( STATUS, 'TITLE', $1);
}
}
}
return $response;
}
Now we have HTML titles recorded down as well, as witnessed in my log file:
[Tue Jan 17 20:33:47 2006] (2484) REQUEST: GET http://perladvent.org/2004/20th/
[Tue Jan 17 20:33:50 2006] (2484) TITLE: Perl 2004 Advent Calendar: Filesys::Virtual
[Tue Jan 17 20:33:50 2006] (2484) RESPONSE: 200 OK
Then feed the customized user agent to my HTTP::Proxy instance I created earlier:
my $agent = MyUA->new(
env_proxy => 1,
timeout => 100,
);
$proxy->agent( $agent );
At last, we enter an infinite loop as every http proxy server:
while (1) {
eval { $proxy->start(); };
warn $@ if $@;
}
That's it!
It already works for me, but there're still several pitfalls in this solution:
Have fun!
Using your own agent... (Score:2)
Hi! I'm the author of HTTP::Proxy. :-) Glad you like it.
I don't think you need to define your own agent to log information. In fact, I think I should never have opened the opportunity to set your own agent. You could simply use a response filter that catches the title tag and print it in you log file.
I also don't understand your while(1) loop. The $proxy->start() is already a while(1) loop.
And you say that the proxy doesn't fork? That probably means you're running it under Win32, don't you? Alas, th
Re:Using your own agent... (Score:1)
I'm so glad to receive feedback from you, the very author of HTTP::Proxy.
Re: (Score:1)