Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

scot (6695)

scot
  (email not shown publicly)
http://redcloudresearch.com/

Perl hacker
"apprentice" sysadmin for ASP running Sparc-Solaris-Java-Oracle stack

Journal of scot (6695)

Tuesday January 02, 2007
04:35 PM

Simple extraction of links from web page

[ #32052 ]

It took me far longer than I thought it would to come up with this code that grabs a web page and stuffs all the page's hyperlinks into a text file.
 
Updated...


use strict;
use WWW::Mechanize;
#usage perl linkextractor.pl http://www.example.com/ > output.txt
        my $url = shift;
                my $mech = WWW::Mechanize->new();
        $mech->get($url);
        my $status=$mech->status();
        print $status." OK-URL request succeeded."."\n";
        my @links = $mech->links;
        print STDOUT ($_->url, $/) foreach $mech->links;

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've now made this much easier in HTML::SimpleLinkExtor 1.14:

    linktractor -f=http://www.example.com > output.txt
    No need to work too hard, after all. :)
    • Thank you. Can you see any obvious snafu's in the following code?:

      use strict;
      use warnings;
      use HTML::SimpleLinkExtor;
      use WWW::Mechanize qw( );

      #usage linkextractor -f http://www.example.com/ [example.com] > output.txt

      my ($url) = @ARGV;

      my $mech = WWW::Mechanize->new();
      my $response = $mech->get($url);
      $response->is_success()
            or die($response->status_line() . "\n");

      my $extor = HTML::SimpleLinkExtor->new();
      $extor->parse($response);
      my @all_links = $extor->links;
      foreach my $elem (@all