findDuplicates.pl #!/usr/bin/perl # ##################################### # # Filename: findDuplicates.pl # Author: Jeremy Pyne # Licence: CC:BY/NC/SA http://creativecommons.org/licenses/by-nc-sa/3.0/ # Last Update: 02/10/2010 # Version: 1.5 # Requires: perl # Description: # This script will look through a directory of files and find and duplicates. It will then # return a list of any such duplicates it finds. This is done by calculating the md5 checksum # of each file and recording it along with the filename. Then the list is sorted by the checksum # and read in line by line. Any time multiple records in a row share a checksum the file names # are written out to stdout. As a result all empty files will be flagged as duplicates as well. # ##################################### # # Get the path from the command line. Thos could be expanded to provide more granular control. $dir = shift; # Set up the location of the temp files. $file = "/tmp/pictures.txt"; $sort = "/tmp/sorted.txt"; # Find all files in the selected directory and calculate their md5sum. This is by far the longest step. `find "$dir" -type file -print0 | xargs -0 md5 -r > $file`; # Sort the resulting file by the md5sum's. `sort $file > $sort`; open FILE, "<$sort" or die $!; my $newmd5; my $newfile; my $lastmd5; my $lastfile; my $lastprint = 0; # Read each line fromt he file. while() { # Extract the md5sum and the filename. $_ =~ /([^ ]+) (.+)/; $newmd5 = $1; $newfile = $2; # If this is the same checksum as the last file then flag it. if($1 =~ $lastmd5) { # If this is the first duplicate for this checksup then print the first file's name. if(!$lastprint) { print("$lastfile\n"); $lastprint = 1; } # Print the conflicting file's name/ print("$newfile\n"); } else { $lastprint = 0; } # Record the last filename and checksup for future testing. $lastmd5 = $newmd5; $lastfile = $newfile; } close(FILE); # Remove the temp files. unlink($file); unlink($sort);
This blog is a collection of my various projects and computer related endeavors. Most of the posts deal with very specific issues/problems and solutions along with custom scripts, extensions, and various other subjects.
Wednesday, February 10, 2010
Find Duplicate Files in the Terminal
I posted an Automator Service last week for finding duplicate photo's in an iPhoto Library. Here is a slightly modified version of the internal script it uses. You can save this script and run it in a terminal to find duplicate file of any kind in any directory tree of your choice. This can also be included in Automater actions itself with the Shell Script action.
Subscribe to:
Post Comments (Atom)
1 comments:
I tried the Automator Service that you posted for finding duplicate photo's in an iPhoto Library.Its good to see that you provided modification and that also for free. Thanks.digital signature PDF
Post a Comment