Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

cog (4665)

Journal of cog (4665)

Saturday December 25, 2004
05:09 PM

samefile

[ #22455 ]
I'm cleaning up my hard disk.

Here's one of the tasks I came across with: a bunch of files in a directory (almost 200), of which I know some of them are the same; I just don't know which ones.

Here's samefile, the script I created to find those files:

#!/usr/bin/perl
use strict;
use warnings;
use File::Compare;
use Getopt::Std;

our %opts = get_options();

show_help()             if $opts{h};
show_version()          if $opts{V};

get_sizes();
find_copies();

# subroutines

sub find_copies {
  our %sizes;
  for (values %sizes) {
    my @files = @{$_};

    while (my $f = shift @files) {
      @files || next;
      my @copies = grep {! compare($f, $_)} @files or next;
      print "\"$f\"", (map {" \"$_\""} @copies), "\n";
      for my $s (@copies) {@files = grep {$_ ne $s} @files}
      }
  }
}

sub get_sizes {
  our %sizes;
  for (@ARGV) {
    if (-f) {
      push @{$sizes{(stat)[7]}}, $_;
    }
    elsif ($opts{r} && -d) {
      push @ARGV, <"$_/*">;
      }
  }
}

sub get_options {
  my %opts;
  getopts('rhV', \%opts );

  for my $key ( keys %opts ) {
    $opts{$key} = 1 unless defined $opts{$key}
  }

  %opts;
}

sub show_help {
  die "Usage: samefile file1 file2
 or:   samefile -r *
samefile: identifies equal files

Options:
  -h         displays this messages and exit
  -r         recursive mode
  -v         show version and exit
"
}

sub show_version {
  die "samefile version 0.01\n";
}

It currently prints something like this:


$ samefile *
"file1" "file3" "file6"
"file4" "file5"

Meaning that file1, file3 and file6 are all alike and likewise for file4 and file5.

Comments on the output or anything else are welcome...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • dupmerge [freshmeat.net]?
  • Thoughts on your code:

    File::Compare already does the size comparison thing for you so there's no need for you to collect filesizes.

    You never check for symbolic links and you are missing the opportunity to compare inodes. If two filenames are links (hard or symbolic) to the same file, there's no need to compare the file to itself.

    Another reason checking for symlinks is important is that if your code encounters a symlink to .. while recursing, it will just sit there twiddling thumbs. It's probably easi

    • Everything you say makes sense :-)

      I had no links, so that was not a problem. I managed to downsize 3.7G of a Portuguese TV show down to 2.8G with this... awesome :-)

      I'll take a look on dupmerge, as you say (it might end up on my ~/bin)

      Thanks :-)