Dariusz on Software

Methods and Tools

About This Site

Software development stuff

Archive

Linux: How To Locate Duplicated Files Quickly
Sat, 25 Feb 2012 13:34:11 +0000

Locate duplicates quickly: I mean only size+filename check, not expensive MD5 sum computation:

find . -printf "%f:%s:%p\n" -type f | \
    awk -F: '
        {
            key=$1 " " $2;
            occur[key]++;
            loc[key]=loc[key] $3 " "
        }
        END {
            for(key in occur) {
                if(occur[key]>1) {
                    print key ": " loc[key]
                }
            }
        }
    ' | sort

A bit of explanation of above magic:

  • printf: tells find command to output file metadata instead of only file path (the default), this metadata (size, filename) will be used later
  • -F: :We want to handle properly paths with spaces, that's why special separator is used
  • key=$1 " " $2: we use file name (without dir) and file size to create ID for this file
  • occur: table (key -> number of file occurences)
  • loc: maps file ID to list of locations found
  • occur[key]>1: we want to show only files that have duplicates
  • sort: results are sorted alphabetically for easier navigation
Tags: linux.

Tags

Created by Chronicle v3.5