# Finding Duplicate Files Fast

Ever wanted to find files with identical content? With file hashing and a bit of cleverness, PowerShell identifies duplicate files in no time.

Ever wanted to find files with identical content? With file hashing and a bit of cleverness, PowerShell identifies duplicate files in no time.

Over time, tons of duplicate files may have accumulated on your hard drives. To find duplicate files and free space, there are two safe assumptions:

• Identical files always have the same size
• The File Content Hash is identical

With these assumptions, PowerShell can quickly identify all duplicate files in any given folder.

Ready-to-use function Find-PSOneDuplicateFile (for the impatient reader)

Here is the ready-to-use function Find-PsOneDuplicateFile (significantly optimized for speed) in case you are impatient. To get and use it, either install the module PSOneTools:

Install-Module -Name PSOneTools -Scope CurrentUser -MinimumVersion 1.7 -Force


Or copy and paste the source code below:

function Find-PSOneDuplicateFile
{
<#
.SYNOPSIS
Identifies files with duplicate content

.DESCRIPTION
Returns a hashtable with the hashes that have at least two files (duplicates)

.EXAMPLE
$Path = [Environment]::GetFolderPath('MyDocuments') Find-PSOneDuplicateFile -Path$Path
Find duplicate files in the user documents folder

.EXAMPLE
Find-PSOneDuplicateFile -Path c:\windows -Filter *.log
find log files in the Windows folder with duplicate content

https://powershell.one
#>

param
(
# Path of folder to recursively search
[String]
[Parameter(Mandatory)]
$Path, # Filter to apply. Default is '*' (all Files) [String]$Filter = '*'
)

# get a hashtable of all files of size greater 0
# grouped by their length

# ENUMERATE ALL FILES RECURSIVELY
# call scriptblocks directly and pipe them together
# this is by far the fastest way and much faster than
# using Foreach-Object:
& {
try
{
# try and use the fast API way of enumerating files recursively
Write-Progress -Activity 'Acquiring Files' -Status 'Fast Method'
[IO.DirectoryInfo]::new($Path).GetFiles('*', 'AllDirectories') } catch { # use PowerShell's own (slow) way of enumerating files if any error occurs: Write-Progress -Activity 'Acquiring Files' -Status 'Falling Back to Slow Method' Get-ChildItem -Path$Path -File -Recurse -ErrorAction Ignore
}
} |
# EXCLUDE EMPTY FILES:
# use direct process blocks with IF (which is much faster than Where-Object):
& {
process
{
# if the file has content...
if ($_.Length -gt 0) { # let it pass through:$_
}
}
} |
# GROUP FILES BY LENGTH, AND RETURN ONLY FILES WHERE THERE IS AT LEAST ONE
# OTHER FILE WITH SAME SIZE
# use direct scriptblocks with own hashtable (which is much faster than Group-Object)
& {
begin
{ $hash = @{} } process { # group files by their length # (use "length" as hashtable key)$file = $_$key = $file.Length.toString() # if we see this key for the first time, create a generic # list to hold group items, and store FileInfo objects in this list # (specialized generic lists are faster than ArrayList): if ($hash.ContainsKey($key) -eq$false)
{
$hash[$key] = [Collections.Generic.List[System.IO.FileInfo]]::new()
}
# add file to appropriate hashtable key:
$hash[$key].Add($file) } end { # return only the files from groups with at least two files # (if there is only one file with a given length, then it # cannot have any duplicates for sure): foreach($pile in $hash.Values) { # are there at least 2 files in this pile? if ($pile.Count -gt 1)
{
# yes, add it to the candidates
$pile } } } } | # CALCULATE THE NUMBER OF FILES TO HASH # collect all files and hand over en-bloc & { end { ,@($input) }
} |
# GROUP FILES BY HASH, AND RETURN ONLY HASHES THAT HAVE AT LEAST TWO FILES:
# use a direct scriptblock call with a hashtable (much faster than Group-Object):
& {
begin
{
$hash = @{} # since this is a length procedure, a progress bar is in order # keep a counter of processed files:$c = 0
}

process
{
$totalNumber =$_.Count
foreach($file in$_)
{

# update progress bar
$c++ # update progress bar every 20 files: if ($c % 20 -eq 0)
{
$percentComplete =$c * 100 / $totalNumber Write-Progress -Activity 'Hashing File Content' -Status$file.Name -PercentComplete $percentComplete } # use the file hash of this file PLUS file length as a key to the hashtable # use the fastest algorithm SHA1$result = Get-FileHash -Path $file.FullName -Algorithm SHA1$key = '{0}:{1}' -f $result.Hash,$file.Length

# if we see this key the first time, add a generic list to this key:
if ($hash.ContainsKey($key) -eq $false) {$hash.Add($key, [Collections.Generic.List[System.IO.FileInfo]]::new()) } # add the file to the approriate group:$hash[$key].Add($file)
}
}

end
{
# remove all hashtable keys with only one file in them

# first, CLONE the list of hashtable keys
# (we cannot remove hashtable keys while enumerating the live
# keys list):
# remove keys
$keys = @($hash.Keys).Clone()

# enumerate all keys...
foreach($key in$keys)
{
# ...if key has only one file, remove it:
if ($hash[$key].Count -eq 1)
{
$hash.Remove($key)
}
}

# return the hashtable with only duplicate files left:
$hash } } }  Once you have Find-PsOneDuplicateFile, here is sample code that illustrates how you can use it to identify all files with identical content: # get path to personal documents folder. That's the place we want # to check for duplicate files. # You can of course assign any path to$Path:
# $Path = 'c:\some\folder\tocheck'$Path = [Environment]::GetFolderPath('MyDocuments')

# check for duplicate files:
$result = Find-PSOneDuplicateFile -Path$Path

# output duplicates
& { foreach($key in$result.Keys)
{
foreach($file in$result[$key]) {$file |
Add-Member -MemberType NoteProperty -Name Hash -Value $key -PassThru | Select-Object Hash, Length, FullName } } } | Format-Table -GroupBy Hash -Property FullName  The result looks similar to this:  Hash: 1ADF61AF293321E46B90E4895247047DF0C576BF:1525 FullName -------- C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... Hash: 9068F3F2CC2884BF4497D800F4F2DA486A9DEA11:4019 FullName -------- C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... Hash: 4EAD535C11BAE32DC28F1F8091D8480048E077A6:11598 FullName -------- C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5.... Hash: 153B4366A6A59923432AA01BD8B9552DF453B3DA:16357 FullName -------- C:\Users\tobia\OneDrive\Dokumente\psconfeu\2020\Automation\Backups\psconfeu2... C:\Users\tobia\OneDrive\Dokumente\psconfeu\2020\Automation\Backups\psconfeu2... Hash: 015A91B39DD7A6BB26411377A60BB58952EDEE63:527 FullName -------- C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7.... C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7.... ...  To actually understand the code, you might want to read the rest of the article. ## About File Hashes A Hash is a short string that uniquely represents any piece of information: a clever mathematic algorithm reads the content and composes a short string. The same content always produces the same hash, and no other content can ever produce this hash. ### Calculating File Hashes Creating file hashes is done by Get-FileHash, and you can pick the algorithm used for calculating the hash as well: # find out the path to the current powershell # replace it with any file path you would like to hash$path = Get-Command -Name powershell.exe |
Select-Object -ExpandProperty Source

# calculate the file hash:
$result = Get-FileHash -Path$path -Algorithm MD5

# output the hash whichis located in a property called "Hash"
$result.Hash  ### Secure Hash Algorithms In reality, based on the hash algorithm used, there is a slight chance that two different contents produce the same hash. That is of course bad, especially when hashes are used for security-related tasks like these: • Passwords are internally saved as hashes, and when someone enters a password, a hash is generated and compared to the hash on file. • Digital signatures encrypt the hash of a file with the private key of a certificate, and to validate a digital signature, the file is hashed again and compared to the decrypted hash of the signature. Obviously, if the hash algorithm has a chance of producing the same hash with different input, it is considered insecure. For our purposes of identifying duplicate files, things are way more relaxed. Even with insecure hash algorithms, the chance is minute that different content produces the same hash. By combining the hash with the file size, it is almost impossible for any hash algorithm to come up with false positives. Which is good, because more insecure hash algorithms are also typically less complex and thus much faster to calculate. ### Fast Hash Algorithms There are many different mathematical approaches to calculate unique hashes, and they differ considerably in complexity (and thus speed): { "type": "line", "data": { "labels": [ "SHA3-512", "SHA3-256", "SHA-256", "SHA512", "MD5", "SHA-1" ], "datasets": [ { "fontSize": 20, "label": "Hash Calculation Speed (MiBps, higher is better)", "fill": false, "lineTension": 0.1, "backgroundColor": "rgba(75,192,192,0.4)", "borderColor": "rgba(75,192,192,1)", "borderCapStyle": "butt", "borderDash": [], "borderDashOffset": 0, "borderJoinStyle": "miter", "pointBorderColor": "rgba(75,192,192,1)", "pointBackgroundColor": "#fff", "pointBorderWidth": 1, "pointHoverRadius": 5, "pointHoverBackgroundColor": "rgba(75,192,192,1)", "pointHoverBorderColor": "rgba(220,220,220,1)", "pointHoverBorderWidth": 2, "pointRadius": 1, "pointHitRadius": 10, "data": [ 198, 367, 413, 623, 632, 909 ], "spanGaps": false } ] }, "options": {} }  Since I am going to need to compute the hash for potentially thousands of files with potentially huge content, I picked the fastest available algorithm SHA-1. ## Strategy to Identify Duplicates Finding duplicate files involves processing thousands of files. Calculating File Hashes can be a very expensive operation: the entire file needs be be read in order to calculate its hash. That can take a long time, just think of all the large files like video files or logs. That’s why it is crucial to minimize the number of files that actually require a hash calculation, and call Get-FileHash on as few files as possible. ### Identify Potential Duplicates To minimize the number of files that need hash calculation, I use the first safe assumption for file duplicates to do a pre-filtering: • Identical files have the same size. • In other words: if there is just one file with a given size, it cannot have duplicates. Below piece of PowerShell code creates a list of files that can have duplicates and require further testing. Essentially, I am limiting the number of files as much as possible that require hash calculation: # start scanning here: # (default to personal documents folder) # use any other path if you like: # i.e.:$Path = 'c:\windows'
$Path = [Environment]::GetFolderPath('MyDocuments') # get a hashtable of all files of size greater 0 # grouped by their length:$group = Get-ChildItem -Path $Path -File -Recurse -ErrorAction Ignore | # EXCLUDE empty files... Where-Object Length -gt 0 | # group them by their LENGTH... Group-Object -Property Length -AsHashTable # take each pile in the hashtable (grouped by their length) # and return all files from piles greater than one element:$candidates = foreach($pile in$group.Values)
{
# are there at least 2 files in this pile?
if ($pile.Count -gt 1) { # yes, add it to the candidates$pile
}
}

# these are files that CAN have duplicates and require more
# testing:
$candidates  #### Learning Points Before we move on, here are some learning points and annotations: • The sample code scans your personal Documents folder. To reliably find the folder, I use the static method GetFolderPath() provided by [Environment]. This method works even if your Documents folder has been redirected to OneDrive. • The code groups all files by property Length to find out if there are at least two files with same length. Grouping is done by Group-Object. If you use Windows PowerShell, there is a bug in Group-Object that can extremely slow down this cmdlet. As I have analyzed previously, the bug does not affect you as long as you use the parameter -AsHashTable. So we are safe here. • Group-Object returns a HashTable with Name-Value pairs. Name are the file lengths, and Value are all files with that length: PS>$group

Name                           Value
----                           -----
9040                           {testing.xlsx}
5943808                        {tutorial.doc}
25249                          {GetExcelSheetInfo.png}
133120                         {ILSpy.BamlDecompiler.Plugin.dll}
4158976                        {planning.doc}
112240                         {001.jpg}
...


To find only files that have at least one other file of same length, I need to go through the Values and include only arrays with at least 2 elements. To do this, I must use a classic foreach loop to loop through the Hashtable Values. I cannot use Foreach-Object and the pipeline, and here is why:

A pipeline always unwraps array elements and processes one item at a time. So if I had piped $group.Value to Foreach-Object, I would end up with one file at a time. I’d no longer be able to know how many files there are per length. • foreach returns all array with at least two files ($pile.Count -gt 1), and the result is now just one new array with all files that have at least another file somewhere of same size.

To better understand how a loop can unwrap a series of arrays, here is a simplified example:

# array with three arrays
$valuesBefore = (1..3),(10..12),(100..103) # array has 3 elements:$valuesBefore.Count

# loop returns each array in array:
$valuesAfter = foreach($value in $valuesBefore) {$value
}
# new array is "unpacked" with 10 elements now:
$valuesAfter.Count  ### Calculating File Hashes In$candidates, I now have a list of files with at least one other file somewhere of same size. These can or cannot be duplicates. To tell for sure, I now have to calculate the File Hashes.

I’d like to group all files by their file hashes, and then check if there are groups with more than one file in them. These would be duplicates.

Fortunately, Group-Object can group objects based on Calculated Properties, so I can submit a scriptblock that dynamically calculates the hash of each file:

$duplicates =$candidates |
# group all files by their hash, placing files with equal content
# in the same group
Group-Object -Property {
(Get-FileHash -Path $_.FullName -Algorithm SHA1).Hash } -AsHashTable -AsString  In the example above, -AsString wouldn’t really be necessary because the file hash is already a string. However, on Windows PowerShell a bug exists with calculated properties. They only work right when you use -AsString. On PowerShell 7, this bug is fixed.$duplicates returns a hashtable with Name-Value pairs again. Name is the unique file hash of the file content, and Value are the files with that file hash.

So any group with more than one file is a duplicate, and any group with just one file in it is ok:

PS> $duplicates Name Value ---- ----- 861B8A8BB5BFB8845A12F7599F2... {documentation.doc} 8F6FD57BB42EC1770025F7AB9EC... {recipe chocolate fudge.doc} CC7BA02BF71E7166E6E826FF178... {psconfeu2019.doc} A51755794151667FEDC675008C1... {Header32.png} FA7B939E162E93B86AE0BB784B4... {sources.cpp} 264F14AE80061C88C0F8792F7BD... {dbserver.log} ...  ### Identifying Real Duplicates The last part is identifying the real duplicates: I am removing all Name-Value pairs from the hashtable that have only one file in them: # test number of unique files:$duplicates.Count

# take all keys of the hashtable
$keys =$duplicates.Keys
# IMPORTANT: clone the list!
$keys =$keys.Clone()

# take all keys...
$keys | # look at all keys that have just one element... Where-Object {$duplicates[$_].Count -eq 1 } | # remove these from the hashtable: ForEach-Object {$duplicates.Remove($_) } #reduced hashtable size to REAL duplicates:$duplicates.Count


Removing entries from a hashtable can be a hassle: you must make sure that you are not enumerating any part of the hashtable while changing it, or else you end up with an ugly exception:

An error occurred while enumerating through a collection: Collection was modified; enumeration operation may not execute..


In the script above, I am using the keys to enumerate through all elements of the hashtable. To avoid the exception, I first Clone the list of keys, then use the cloned list in my loop. This way, I can change the hashtable any way I like, and remove all keys that contain just one file.

## Putting It All Together

Identifying duplicate files is useful, so let’s put all parts together and bake a useful function of it: Find-PSOneDuplicateFile.

I have added a couple of speed optimizations plus some progress bars, so Find-PSOneDuplicateFile is really astonishingly fast.

function Find-PSOneDuplicateFile
{
<#
.SYNOPSIS
Identifies files with duplicate content

.DESCRIPTION
Returns a hashtable with the hashes that have at least two files (duplicates)

.EXAMPLE
$Path = [Environment]::GetFolderPath('MyDocuments') Find-PSOneDuplicateFile -Path$Path
Find duplicate files in the user documents folder

.EXAMPLE
Find-PSOneDuplicateFile -Path c:\windows -Filter *.log
find log files in the Windows folder with duplicate content

https://powershell.one
#>

param
(
# Path of folder to recursively search
[String]
[Parameter(Mandatory)]
$Path, # Filter to apply. Default is '*' (all Files) [String]$Filter = '*'
)

# get a hashtable of all files of size greater 0
# grouped by their length

# ENUMERATE ALL FILES RECURSIVELY
# call scriptblocks directly and pipe them together
# this is by far the fastest way and much faster than
# using Foreach-Object:
& {
try
{
# try and use the fast API way of enumerating files recursively
Write-Progress -Activity 'Acquiring Files' -Status 'Fast Method'
[IO.DirectoryInfo]::new($Path).GetFiles('*', 'AllDirectories') } catch { # use PowerShell's own (slow) way of enumerating files if any error occurs: Write-Progress -Activity 'Acquiring Files' -Status 'Falling Back to Slow Method' Get-ChildItem -Path$Path -File -Recurse -ErrorAction Ignore
}
} |
# EXCLUDE EMPTY FILES:
# use direct process blocks with IF (which is much faster than Where-Object):
& {
process
{
# if the file has content...
if ($_.Length -gt 0) { # let it pass through:$_
}
}
} |
# GROUP FILES BY LENGTH, AND RETURN ONLY FILES WHERE THERE IS AT LEAST ONE
# OTHER FILE WITH SAME SIZE
# use direct scriptblocks with own hashtable (which is much faster than Group-Object)
& {
begin
{ $hash = @{} } process { # group files by their length # (use "length" as hashtable key)$file = $_$key = $file.Length.toString() # if we see this key for the first time, create a generic # list to hold group items, and store FileInfo objects in this list # (specialized generic lists are faster than ArrayList): if ($hash.ContainsKey($key) -eq$false)
{
$hash[$key] = [Collections.Generic.List[System.IO.FileInfo]]::new()
}
# add file to appropriate hashtable key:
$hash[$key].Add($file) } end { # return only the files from groups with at least two files # (if there is only one file with a given length, then it # cannot have any duplicates for sure): foreach($pile in $hash.Values) { # are there at least 2 files in this pile? if ($pile.Count -gt 1)
{
# yes, add it to the candidates
$pile } } } } | # CALCULATE THE NUMBER OF FILES TO HASH # collect all files and hand over en-bloc & { end { ,@($input) }
} |
# GROUP FILES BY HASH, AND RETURN ONLY HASHES THAT HAVE AT LEAST TWO FILES:
# use a direct scriptblock call with a hashtable (much faster than Group-Object):
& {
begin
{
$hash = @{} # since this is a length procedure, a progress bar is in order # keep a counter of processed files:$c = 0
}

process
{
$totalNumber =$_.Count
foreach($file in$_)
{

# update progress bar
$c++ # update progress bar every 20 files: if ($c % 20 -eq 0)
{
$percentComplete =$c * 100 / $totalNumber Write-Progress -Activity 'Hashing File Content' -Status$file.Name -PercentComplete $percentComplete } # use the file hash of this file PLUS file length as a key to the hashtable # use the fastest algorithm SHA1$result = Get-FileHash -Path $file.FullName -Algorithm SHA1$key = '{0}:{1}' -f $result.Hash,$file.Length

# if we see this key the first time, add a generic list to this key:
if ($hash.ContainsKey($key) -eq $false) {$hash.Add($key, [Collections.Generic.List[System.IO.FileInfo]]::new()) } # add the file to the approriate group:$hash[$key].Add($file)
}
}

end
{
# remove all hashtable keys with only one file in them

# first, CLONE the list of hashtable keys
# (we cannot remove hashtable keys while enumerating the live
# keys list):
# remove keys
$keys = @($hash.Keys).Clone()

# enumerate all keys...
foreach($key in$keys)
{
# ...if key has only one file, remove it:
if ($hash[$key].Count -eq 1)
{
$hash.Remove($key)
}
}

# return the hashtable with only duplicate files left:
$hash } } } # get path to personal documents folder$Path = [Environment]::GetFolderPath('MyDocuments')

# check for duplicate files:
$result = Find-PSOneDuplicateFile -Path$Path

# output duplicates
& { foreach($key in$result.Keys)
{
foreach($file in$result[$key]) {$file |
Add-Member -MemberType NoteProperty -Name Hash -Value $key -PassThru | Select-Object Hash, Length, FullName } } } | Format-Table -GroupBy Hash -Property FullName  Here are some of the tricks I have used in the final function to speed it up considerably: ### Fast File Enumeration Get-ChildItem can recursively search folders but it is very slow, especially in Windows PowerShell. A much faster file enumeration uses [System.IO.DirectoryInfo] and its method GetFiles(). This approach has one issue, though: if there is just one file that can’t be accessed, the entire method fails. So I used try…catch and a fallback mechanism: should GetFiles() fail, the function uses the slow Get-ChildItem instead. ### Faster Foreach-Object and Where-Object As pointed out elsewhere, Foreach-Object and Where-Object are very slow when you are sending a lot of objects through the pipeline. A much faster approach uses direct scriptblock calls. ### Faster Group-Object As pointed out elsewhere, Group-Object has a bug in Windows PowerShell and performs poorly. A much faster approach - that’s faster even in PowerShell 7 - uses a hashtable and groups objects manually. ### Faster ArrayList To collect the elements of a group, I use Generic Lists. They are faster and more efficient than ArrayList objects. ### Faster Progress Bar Progress bars can be easily implemented using Write-Progress, and they are a good idea for long-running scripts to provide feedback to the user. Updating the progress bar adds a time penalty, though, so it is not wise to update a progress bar for every file processed. Instead, I am using a counter and the Modulo Operator (%) to update the progress bar only every 20 files. ### Efficiently Calculating Number of Files To be able to show a real progress bar, it is necessary to know the total number of files to be processed. Only then can you calculate how many percent have been processed at any given time. A pipeline is a streaming construct, so the total number of files to be processed is not known. That’s why I am using an easy trick to collect all files at some point, and pass them on en-bloc: & { end { ,@($input) }
}


I am calling a scriptblock with just an End Block and make sure I am not using any attributes. This creates a Simple Function which automatically collects all piped input in $input.$input is not an array but an enumerator. To pass all collected files en-bloc to the next pipeline command, I am wrapping \$input in @() (effectively turning the enumerator into an array), and prepend it with a comma. This wraps the array with the files inside another array.

When this nested array is passed to the next pipeline command, the pipeline unwraps the Outer Array only, and the Inner Array is handed over to the next pipeline command in one piece.

This way, the next pipeline command can determine the total size of the array and then use a loop to process the content of the array.