Wednesday, February 10, 2010

Find Duplicate Files in the Terminal

I posted an Automator Service last week for finding duplicate photo's in an iPhoto Library.  Here is a slightly modified version of the internal script it uses. You can save this script and run it in a terminal to find duplicate file of any kind in any directory tree of your choice.  This can also be included in Automater actions itself with the Shell Script action.

findDuplicates.pl
#!/usr/bin/perl

# ##################################### #
# Filename:      findDuplicates.pl
# Author:        Jeremy Pyne
# Licence:       CC:BY/NC/SA  http://creativecommons.org/licenses/by-nc-sa/3.0/
# Last Update:   02/10/2010
# Version:       1.5
# Requires:      perl
# Description:
#   This script will look through a directory of files and find and duplicates.  It will then
#   return a list of any such duplicates it finds.  This is done by calculating the md5 checksum
#   of each file and recording it along with the filename.  Then the list is sorted by the checksum
#   and read in line by line.  Any time multiple records in a row share a checksum the file names
#   are written out to stdout.  As a result all empty files will be flagged as duplicates as well.
# ##################################### #

# Get the path from the command line.  Thos could be expanded to provide more granular control.
$dir = shift;

# Set up the location of the temp files.
$file = "/tmp/pictures.txt";
$sort = "/tmp/sorted.txt";

# Find all files in the selected directory and calculate their md5sum.  This is by far the longest step.
`find "$dir" -type file -print0 | xargs -0 md5 -r > $file`;
# Sort the resulting file by the md5sum's.
`sort $file > $sort`;

open FILE, "<$sort" or die $!;

my $newmd5;
my $newfile;
my $lastmd5;
my $lastfile;
my $lastprint = 0;

# Read each line fromt he file.
while() {
        # Extract the md5sum and the filename.
        $_ =~ /([^ ]+) (.+)/;

        $newmd5 = $1;
        $newfile = $2;

        # If this is the same checksum as the last file then flag it.
        if($1 =~ $lastmd5)
        {
                # If this is the first duplicate for this checksup then print the first file's name.
                if(!$lastprint)
                {
                        print("$lastfile\n");
                        $lastprint = 1;
                }
                # Print the conflicting file's name/
                print("$newfile\n");
        }
        else
        {
                $lastprint = 0;
        }

        # Record the last filename and checksup for future testing.
        $lastmd5 = $newmd5;
        $lastfile = $newfile;
}

close(FILE);

# Remove the temp files.
unlink($file);
unlink($sort);

1 comments:

lauren said...

I tried the Automator Service that you posted for finding duplicate photo's in an iPhoto Library.Its good to see that you provided modification and that also for free. Thanks.digital signature PDF