Managing Duplicate Files
I use a headless linux server with a RAID5 disk array for backing up files and as a general filestore for non-critical data. Over time, especially due to changing computers and running out of disk space, I'm sure I've made some mistakes in migrating files across. In one case, there was a corrupt disk on an old machine and to this day, I'm still not certain if I've retrieved everything. Fortunately that was non-critical data and I don't miss it. It's kind of the equivalent of putting boxes in the garage without sorting through them.
Spring Cleaning
And like putting boxes in the garage, it's usually worth clearing them out on a regular basis. And here's my problem, I'm sure I've got duplicate files all over the place. I'm also sure each file is in a logical directory, just that some may have cause to be in several.
So a quick search for finding duplicates (other than doing a recursive file list and organising it by filesize and filename) showed me fdupes. This looks to be a great little tool.
Fdupes
Install it as normal, I use Ubuntu so I write from that perspective:
sudo apt-get install fdupes
Navigate to the top directory from which you wish to search. I had a bunch of mp3 mixes that I wanted to work through. These were the files I used to use to pass ideas between collaborators. At the time, they were works-in-progress, now they're records of progress.
fdupes -r -S . > files.txt
The -r is for recursing subdirectores, the -S is to display the filesize.
Then work through the file "files.txt" and decide which to delete. Since it's a headless server, I had to open up another session and display the file in one session and work through deleting files in another session.
There's a good discussion in the comments on Ubuntu Blog about pros and cons of fdupes and a few other options.