Finding Duplicate Files Fast

Ever wanted to find files with identical content? With file hashing and a bit of cleverness, PowerShell identifies duplicate files in no time.

Ever wanted to find files with identical content? With file hashing and a bit of cleverness, PowerShell identifies duplicate files in no time.

Over time, tons of duplicate files may have accumulated on your hard drives. To find duplicate files and free space, there are two safe assumptions:

  • Identical files always have the same size
  • The File Content Hash is identical

With these assumptions, PowerShell can quickly identify all duplicate files in any given folder. And if you have to deal with large files, calculating partial hashes can speed things up even more.

Ready-to-use function Find-PSOneDuplicateFile (for the impatient reader)

Here is the ready-to-use function Find-PsOneDuplicateFile (significantly optimized for speed) in case you are impatient. To get it, either install the module PSOneTools:

Install-Module -Name PSOneTools -Scope CurrentUser -MinimumVersion 2.2 -Force

Or start with Find-PsOneDuplicateFile and copy and paste the source code below:

function Find-PSOneDuplicateFile
{
  <#
      .SYNOPSIS
      Identifies files with duplicate content

      .DESCRIPTION
      Returns a hashtable with the hashes that have at least two files (duplicates)

      .EXAMPLE
      $Path = [Environment]::GetFolderPath('MyDocuments')
	  Find-PSOneDuplicateFile -Path $Path 
      Find duplicate files in the user documents folder

      .EXAMPLE
      Find-PSOneDuplicateFile -Path c:\windows -Filter *.log 
      find log files in the Windows folder with duplicate content

      .LINK
      https://powershell.one
  #>


  param
  (
    # Path of folder to recursively search
    [String]
    [Parameter(Mandatory)]
    $Path,
  
    # Filter to apply. Default is '*' (all Files) 
    [String]
    $Filter = '*'
  )

  # get a hashtable of all files of size greater 0
  # grouped by their length
  
  
  # ENUMERATE ALL FILES RECURSIVELY
  # call scriptblocks directly and pipe them together
  # this is by far the fastest way and much faster than
  # using Foreach-Object:
  & { 
    try
    {
      # try and use the fast API way of enumerating files recursively
      # this FAILS whenever there is any "Access Denied" errors
      Write-Progress -Activity 'Acquiring Files' -Status 'Fast Method'
      [IO.DirectoryInfo]::new($Path).GetFiles('*', 'AllDirectories')
    }
    catch
    {
      # use PowerShell's own (slow) way of enumerating files if any error occurs:
      Write-Progress -Activity 'Acquiring Files' -Status 'Falling Back to Slow Method'
      Get-ChildItem -Path $Path -File -Recurse -ErrorAction Ignore
    }
  } | 
  # EXCLUDE EMPTY FILES:
  # use direct process blocks with IF (which is much faster than Where-Object):
  & {
    process
    {
      # if the file has content...
      if ($_.Length -gt 0)
      {
        # let it pass through:
        $_
      }
    }
  } | 
  # GROUP FILES BY LENGTH, AND RETURN ONLY FILES WHERE THERE IS AT LEAST ONE
  # OTHER FILE WITH SAME SIZE
  # use direct scriptblocks with own hashtable (which is much faster than Group-Object)
  & { 
    begin 
    # start with an empty hashtable
    { $hash = @{} } 

    process 
    { 
      # group files by their length
      # (use "length" as hashtable key)
      $file = $_
      $key = $file.Length.toString()
      
      # if we see this key for the first time, create a generic
      # list to hold group items, and store FileInfo objects in this list
      # (specialized generic lists are faster than ArrayList):
      if ($hash.ContainsKey($key) -eq $false) 
      {
        $hash[$key] = [Collections.Generic.List[System.IO.FileInfo]]::new()
      }
      # add file to appropriate hashtable key:
      $hash[$key].Add($file)
    } 
  
    end 
    { 
      # return only the files from groups with at least two files
      # (if there is only one file with a given length, then it 
      # cannot have any duplicates for sure):
      foreach($pile in $hash.Values)
      {
        # are there at least 2 files in this pile?
        if ($pile.Count -gt 1)
        {
          # yes, add it to the candidates
          $pile
        }
      }
    } 
  } | 
  # CALCULATE THE NUMBER OF FILES TO HASH
  # collect all files and hand over en-bloc
  & {
    end { ,@($input) }
  } |
  # GROUP FILES BY HASH, AND RETURN ONLY HASHES THAT HAVE AT LEAST TWO FILES:
  # use a direct scriptblock call with a hashtable (much faster than Group-Object):
  & {
    begin 
    {
      # start with an empty hashtable
      $hash = @{}
      
      # since this is a length procedure, a progress bar is in order
      # keep a counter of processed files:
      $c = 0
    }
      
    process
    {
      $totalNumber = $_.Count
      foreach($file in $_)
      {
      
        # update progress bar
        $c++
      
        # update progress bar every 20 files:
        if ($c % 20 -eq 0)
        {
          $percentComplete = $c * 100 / $totalNumber
          Write-Progress -Activity 'Hashing File Content' -Status $file.Name -PercentComplete $percentComplete
        }
      
        # use the file hash of this file PLUS file length as a key to the hashtable
        # use the fastest algorithm SHA1
        $result = Get-FileHash -Path $file.FullName -Algorithm SHA1
        $key = '{0}:{1}' -f $result.Hash, $file.Length
      
        # if we see this key the first time, add a generic list to this key:
        if ($hash.ContainsKey($key) -eq $false)
        {
          $hash.Add($key, [Collections.Generic.List[System.IO.FileInfo]]::new())
        }
      
        # add the file to the approriate group:
        $hash[$key].Add($file)
      }
    }
      
    end
    {
      # remove all hashtable keys with only one file in them
      
      # first, CLONE the list of hashtable keys
      # (we cannot remove hashtable keys while enumerating the live
      # keys list):
      # remove keys
      $keys = @($hash.Keys).Clone()
      
      # enumerate all keys...
      foreach($key in $keys)
      {
        # ...if key has only one file, remove it:
        if ($hash[$key].Count -eq 1)
        {
          $hash.Remove($key)
        }
      }
       
      # return the hashtable with only duplicate files left:
      $hash
    }
  }
}

Once you have Find-PsOneDuplicateFile, here is sample code that illustrates how you can use it to identify all files with identical content:

# get path to personal documents folder. That's the place we want
# to check for duplicate files.
# You can of course assign any path to $Path:
# $Path = 'c:\some\folder\tocheck'
$Path = [Environment]::GetFolderPath('MyDocuments')

# check for duplicate files:
$result = Find-PSOneDuplicateFile -Path $Path 

# output duplicates
& { foreach($key in $result.Keys)
{
    foreach($file in $result[$key])
    {
        $file |
            Add-Member -MemberType NoteProperty -Name Hash -Value $key -PassThru | 
            Select-Object Hash, Length, FullName 
    }
}
} | Format-Table -GroupBy Hash -Property FullName

The result looks similar to this:

   Hash: 1ADF61AF293321E46B90E4895247047DF0C576BF:1525

FullName                                                                       
--------                                                                       
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....


   Hash: 9068F3F2CC2884BF4497D800F4F2DA486A9DEA11:4019

FullName                                                                       
--------                                                                       
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....


   Hash: 4EAD535C11BAE32DC28F1F8091D8480048E077A6:11598

FullName                                                                       
--------                                                                       
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ImportExcel\6.5....


   Hash: 153B4366A6A59923432AA01BD8B9552DF453B3DA:16357

FullName                                                                       
--------                                                                       
C:\Users\tobia\OneDrive\Dokumente\psconfeu\2020\Automation\Backups\psconfeu2...
C:\Users\tobia\OneDrive\Dokumente\psconfeu\2020\Automation\Backups\psconfeu2...


   Hash: 015A91B39DD7A6BB26411377A60BB58952EDEE63:527

FullName                                                                       
--------                                                                       
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7....
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ISESteroids\2.7....
...

To actually understand the code, you might want to read the rest of the article.

Ready-to-use function Find-PSOneDuplicateFileFast (for the impatient reader)

Meanwhile, I added another function called Find-PsOneDuplicateFileFast to the module PSOneTools:

Install-Module -Name PSOneTools -Scope CurrentUser -MinimumVersion 2.3 -Force

This function is insanely much faster if you need to check large files and/or use slow network connections. The idea behind Find-PsOneDuplicateFileFast is that it doesn’t make sense to always calculate the full hash for large files. Instead, when large files have the same size, and when a siginificant part of its content is identical, then in most cases it is safe to assume that the entire file is identical.

Here is some code to test-drive Find-PsOneDuplicateFileFast:

$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# use partial hashes for files larger than 100KB:
$result = Find-PSOneDuplicateFileFast -Path ([Environment]::GetFolderPath('MyDocuments'))  -MaxFileSize 100KB
$stopwatch.Stop()
[PSCustomObject]@{
  Seconds = $stopwatch.Elapsed.TotalSeconds
  Count = $result.Count
  }

More details can be found here.

About File Hashes

A Hash is a short string that uniquely represents any piece of information: a clever mathematic algorithm reads the content and composes a short string. The same content always produces the same hash, and no other content can ever produce this hash.

Calculating File Hashes

Creating file hashes is done by Get-FileHash, and you can pick the algorithm used for calculating the hash as well:

# find out the path to the current powershell 
# replace it with any file path you would like to hash
$path = Get-Command -Name powershell.exe | 
           Select-Object -ExpandProperty Source
           
# calculate the file hash:
$result = Get-FileHash -Path $path -Algorithm MD5

# output the hash whichis located in a property called "Hash"
$result.Hash

Secure Hash Algorithms

In reality, based on the hash algorithm used, there is a slight chance that two different contents produce the same hash. That is of course bad, especially when hashes are used for security-related tasks like these:

  • Passwords are internally saved as hashes, and when someone enters a password, a hash is generated and compared to the hash on file.
  • Digital signatures encrypt the hash of a file with the private key of a certificate, and to validate a digital signature, the file is hashed again and compared to the decrypted hash of the signature.

Obviously, if the hash algorithm has a chance of producing the same hash with different input, it is considered insecure.

For our purposes of identifying duplicate files, things are way more relaxed. Even with insecure hash algorithms, the chance is minute that different content produces the same hash. By combining the hash with the file size, it is almost impossible for any hash algorithm to come up with false positives.

Which is good, because more insecure hash algorithms are also typically less complex and thus much faster to calculate.

Fast Hash Algorithms

There are many different mathematical approaches to calculate unique hashes, and they differ considerably in complexity (and thus speed):

{
  "type": "line",
  "data": {
    "labels": [
      "SHA3-512",
      "SHA3-256",
      "SHA-256",
      "SHA512",
      "MD5",
      "SHA-1"
    ],
    "datasets": [
      {
        "fontSize": 20,
        "label": "Hash Calculation Speed (MiBps, higher is better)",
        "fill": false,
        "lineTension": 0.1,
        "backgroundColor": "rgba(75,192,192,0.4)",
        "borderColor": "rgba(75,192,192,1)",
        "borderCapStyle": "butt",
        "borderDash": [],
        "borderDashOffset": 0,
        "borderJoinStyle": "miter",
        "pointBorderColor": "rgba(75,192,192,1)",
        "pointBackgroundColor": "#fff",
        "pointBorderWidth": 1,
        "pointHoverRadius": 5,
        "pointHoverBackgroundColor": "rgba(75,192,192,1)",
        "pointHoverBorderColor": "rgba(220,220,220,1)",
        "pointHoverBorderWidth": 2,
        "pointRadius": 1,
        "pointHitRadius": 10,
        "data": [
          198,
          367,
          413,
          623,
          632,
          909
        ],
        "spanGaps": false
      }
    ]
  },
  "options": {}
}

(Data Source)

Since I am going to need to compute the hash for potentially thousands of files with potentially huge content, I picked the fastest available algorithm SHA-1.

Strategy to Identify Duplicates

Finding duplicate files involves processing thousands of files. Calculating File Hashes can be a very expensive operation: the entire file needs be be read in order to calculate its hash. That can take a long time, just think of all the large files like video files or logs.

That’s why it is crucial to minimize the number of files that actually require a hash calculation, and call Get-FileHash on as few files as possible.

Identify Potential Duplicates

To minimize the number of files that need hash calculation, I use the first safe assumption for file duplicates to do a pre-filtering:

  • Identical files have the same size.
  • In other words: if there is just one file with a given size, it cannot have duplicates.

Below piece of PowerShell code creates a list of files that can have duplicates and require further testing. Essentially, I am limiting the number of files as much as possible that require hash calculation:

# start scanning here:
# (default to personal documents folder)
# use any other path if you like:
# i.e.: $Path = 'c:\windows'
$Path = [Environment]::GetFolderPath('MyDocuments')

# get a hashtable of all files of size greater 0
# grouped by their length:
$group = Get-ChildItem -Path $Path -File -Recurse -ErrorAction Ignore |
    # EXCLUDE empty files...
    Where-Object Length -gt 0 |
    # group them by their LENGTH...
    Group-Object -Property Length -AsHashTable 
    
# take each pile in the hashtable (grouped by their length)
# and return all files from piles greater than one element:
$candidates = foreach($pile in $group.Values)
{
    # are there at least 2 files in this pile?
    if ($pile.Count -gt 1)
    {
    	# yes, add it to the candidates
        $pile
    }
}
    
# these are files that CAN have duplicates and require more
# testing:
$candidates

Learning Points

Before we move on, here are some learning points and annotations:

  • The sample code scans your personal Documents folder. To reliably find the folder, I use the static method GetFolderPath() provided by [Environment]. This method works even if your Documents folder has been redirected to OneDrive.

  • The code groups all files by property Length to find out if there are at least two files with same length. Grouping is done by Group-Object. If you use Windows PowerShell, there is a bug in Group-Object that can extremely slow down this cmdlet. As I have analyzed previously, the bug does not affect you as long as you use the parameter -AsHashTable. So we are safe here.

  • Group-Object returns a HashTable with Name-Value pairs. Name are the file lengths, and Value are all files with that length:

    PS> $group
      
    Name                           Value        
    ----                           ----- 
    9040                           {testing.xlsx} 
    5943808                        {tutorial.doc}  
    25249                          {GetExcelSheetInfo.png}  
    133120                         {ILSpy.BamlDecompiler.Plugin.dll} 
    4158976                        {planning.doc}  
    112240                         {001.jpg}    
    ...
    

    To find only files that have at least one other file of same length, I need to go through the Values and include only arrays with at least 2 elements. To do this, I must use a classic foreach loop to loop through the Hashtable Values. I cannot use Foreach-Object and the pipeline, and here is why:

    A pipeline always unwraps array elements and processes one item at a time. So if I had piped $group.Value to Foreach-Object, I would end up with one file at a time. I’d no longer be able to know how many files there are per length.

  • foreach returns all array with at least two files ($pile.Count -gt 1), and the result is now just one new array with all files that have at least another file somewhere of same size.

    To better understand how a loop can unwrap a series of arrays, here is a simplified example:

    # array with three arrays
    $valuesBefore = (1..3),(10..12),(100..103)
    # array has 3 elements:
    $valuesBefore.Count
      
    # loop returns each array in array:
    $valuesAfter = foreach($value in $valuesBefore)
    {
        $value
    }
    # new array is "unpacked" with 10 elements now:
    $valuesAfter.Count
    

Calculating File Hashes

In $candidates, I now have a list of files with at least one other file somewhere of same size. These can or cannot be duplicates. To tell for sure, I now have to calculate the File Hashes.

I’d like to group all files by their file hashes, and then check if there are groups with more than one file in them. These would be duplicates.

Fortunately, Group-Object can group objects based on Calculated Properties, so I can submit a scriptblock that dynamically calculates the hash of each file:

$duplicates = $candidates |
  # group all files by their hash, placing files with equal content
  # in the same group
  Group-Object -Property {
        (Get-FileHash -Path $_.FullName -Algorithm SHA1).Hash
    } -AsHashTable -AsString

In the example above, -AsString wouldn’t really be necessary because the file hash is already a string.

However, on Windows PowerShell a bug exists with calculated properties. They only work right when you use -AsString. On PowerShell 7, this bug is fixed.

$duplicates returns a hashtable with Name-Value pairs again. Name is the unique file hash of the file content, and Value are the files with that file hash.

So any group with more than one file is a duplicate, and any group with just one file in it is ok:

PS> $duplicates
Name                           Value             
----                           -----           
861B8A8BB5BFB8845A12F7599F2... {documentation.doc}     
8F6FD57BB42EC1770025F7AB9EC... {recipe chocolate fudge.doc}  
CC7BA02BF71E7166E6E826FF178... {psconfeu2019.doc}   
A51755794151667FEDC675008C1... {Header32.png}      
FA7B939E162E93B86AE0BB784B4... {sources.cpp}   
264F14AE80061C88C0F8792F7BD... {dbserver.log}
...

Identifying Real Duplicates

The last part is identifying the real duplicates: I am removing all Name-Value pairs from the hashtable that have only one file in them:

# test number of unique files:
$duplicates.Count

# take all keys of the hashtable
$keys = $duplicates.Keys
# IMPORTANT: clone the list!
$keys = $keys.Clone()

# take all keys...
$keys | 
    # look at all keys that have just one element...
    Where-Object { $duplicates[$_].Count -eq 1 } | 
    # remove these from the hashtable:
    ForEach-Object { $duplicates.Remove($_) }

#reduced hashtable size to REAL duplicates:
$duplicates.Count

Removing entries from a hashtable can be a hassle: you must make sure that you are not enumerating any part of the hashtable while changing it, or else you end up with an ugly exception:

An error occurred while enumerating through a collection: Collection was modified; enumeration operation may not execute..

In the script above, I am using the keys to enumerate through all elements of the hashtable. To avoid the exception, I first Clone the list of keys, then use the cloned list in my loop. This way, I can change the hashtable any way I like, and remove all keys that contain just one file.

Putting It All Together

Identifying duplicate files is useful, so let’s put all parts together and bake a useful function of it: Find-PSOneDuplicateFile.

I have added a couple of speed optimizations plus some progress bars, so Find-PSOneDuplicateFile is really astonishingly fast.

function Find-PSOneDuplicateFile
{
  <#
      .SYNOPSIS
      Identifies files with duplicate content

      .DESCRIPTION
      Returns a hashtable with the hashes that have at least two files (duplicates)

      .EXAMPLE
      $Path = [Environment]::GetFolderPath('MyDocuments')
	  Find-PSOneDuplicateFile -Path $Path 
      Find duplicate files in the user documents folder

      .EXAMPLE
      Find-PSOneDuplicateFile -Path c:\windows -Filter *.log 
      find log files in the Windows folder with duplicate content

      .LINK
      https://powershell.one
  #>


  param
  (
    # Path of folder to recursively search
    [String]
    [Parameter(Mandatory)]
    $Path,
  
    # Filter to apply. Default is '*' (all Files) 
    [String]
    $Filter = '*'
  )

  # get a hashtable of all files of size greater 0
  # grouped by their length
  
  
  # ENUMERATE ALL FILES RECURSIVELY
  # call scriptblocks directly and pipe them together
  # this is by far the fastest way and much faster than
  # using Foreach-Object:
  & { 
    try
    {
      # try and use the fast API way of enumerating files recursively
      # this FAILS whenever there is any "Access Denied" errors
      Write-Progress -Activity 'Acquiring Files' -Status 'Fast Method'
      [IO.DirectoryInfo]::new($Path).GetFiles('*', 'AllDirectories')
    }
    catch
    {
      # use PowerShell's own (slow) way of enumerating files if any error occurs:
      Write-Progress -Activity 'Acquiring Files' -Status 'Falling Back to Slow Method'
      Get-ChildItem -Path $Path -File -Recurse -ErrorAction Ignore
    }
  } | 
  # EXCLUDE EMPTY FILES:
  # use direct process blocks with IF (which is much faster than Where-Object):
  & {
    process
    {
      # if the file has content...
      if ($_.Length -gt 0)
      {
        # let it pass through:
        $_
      }
    }
  } | 
  # GROUP FILES BY LENGTH, AND RETURN ONLY FILES WHERE THERE IS AT LEAST ONE
  # OTHER FILE WITH SAME SIZE
  # use direct scriptblocks with own hashtable (which is much faster than Group-Object)
  & { 
    begin 
    # start with an empty hashtable
    { $hash = @{} } 

    process 
    { 
      # group files by their length
      # (use "length" as hashtable key)
      $file = $_
      $key = $file.Length.toString()
      
      # if we see this key for the first time, create a generic
      # list to hold group items, and store FileInfo objects in this list
      # (specialized generic lists are faster than ArrayList):
      if ($hash.ContainsKey($key) -eq $false) 
      {
        $hash[$key] = [Collections.Generic.List[System.IO.FileInfo]]::new()
      }
      # add file to appropriate hashtable key:
      $hash[$key].Add($file)
    } 
  
    end 
    { 
      # return only the files from groups with at least two files
      # (if there is only one file with a given length, then it 
      # cannot have any duplicates for sure):
      foreach($pile in $hash.Values)
      {
        # are there at least 2 files in this pile?
        if ($pile.Count -gt 1)
        {
          # yes, add it to the candidates
          $pile
        }
      }
    } 
  } | 
  # CALCULATE THE NUMBER OF FILES TO HASH
  # collect all files and hand over en-bloc
  & {
    end { ,@($input) }
  } |
  # GROUP FILES BY HASH, AND RETURN ONLY HASHES THAT HAVE AT LEAST TWO FILES:
  # use a direct scriptblock call with a hashtable (much faster than Group-Object):
  & {
    begin 
    {
      # start with an empty hashtable
      $hash = @{}
      
      # since this is a length procedure, a progress bar is in order
      # keep a counter of processed files:
      $c = 0
    }
      
    process
    {
      $totalNumber = $_.Count
      foreach($file in $_)
      {
      
        # update progress bar
        $c++
      
        # update progress bar every 20 files:
        if ($c % 20 -eq 0)
        {
          $percentComplete = $c * 100 / $totalNumber
          Write-Progress -Activity 'Hashing File Content' -Status $file.Name -PercentComplete $percentComplete
        }
      
        # use the file hash of this file PLUS file length as a key to the hashtable
        # use the fastest algorithm SHA1
        $result = Get-FileHash -Path $file.FullName -Algorithm SHA1
        $key = '{0}:{1}' -f $result.Hash, $file.Length
      
        # if we see this key the first time, add a generic list to this key:
        if ($hash.ContainsKey($key) -eq $false)
        {
          $hash.Add($key, [Collections.Generic.List[System.IO.FileInfo]]::new())
        }
      
        # add the file to the approriate group:
        $hash[$key].Add($file)
      }
    }
      
    end
    {
      # remove all hashtable keys with only one file in them
      
      # first, CLONE the list of hashtable keys
      # (we cannot remove hashtable keys while enumerating the live
      # keys list):
      # remove keys
      $keys = @($hash.Keys).Clone()
      
      # enumerate all keys...
      foreach($key in $keys)
      {
        # ...if key has only one file, remove it:
        if ($hash[$key].Count -eq 1)
        {
          $hash.Remove($key)
        }
      }
       
      # return the hashtable with only duplicate files left:
      $hash
    }
  }
}


# get path to personal documents folder
$Path = [Environment]::GetFolderPath('MyDocuments')

# check for duplicate files:
$result = Find-PSOneDuplicateFile -Path $Path 

# output duplicates
& { foreach($key in $result.Keys)
{
    foreach($file in $result[$key])
    {
        $file |
            Add-Member -MemberType NoteProperty -Name Hash -Value $key -PassThru | 
            Select-Object Hash, Length, FullName 
    }
}
} | Format-Table -GroupBy Hash -Property FullName

Here are some of the tricks I have used in the final function to speed it up considerably:

Fast File Enumeration

Get-ChildItem can recursively search folders but it is very slow, especially in Windows PowerShell. A much faster file enumeration uses [System.IO.DirectoryInfo] and its method GetFiles().

This approach has one issue, though: if there is just one file that can’t be accessed, the entire method fails. So I used try…catch and a fallback mechanism: should GetFiles() fail, the function uses the slow Get-ChildItem instead.

Faster Foreach-Object and Where-Object

As pointed out elsewhere, Foreach-Object and Where-Object are very slow when you are sending a lot of objects through the pipeline. A much faster approach uses direct scriptblock calls.

Faster Group-Object

As pointed out elsewhere, Group-Object has a bug in Windows PowerShell and performs poorly. A much faster approach - that’s faster even in PowerShell 7 - uses a hashtable and groups objects manually.

Faster ArrayList

To collect the elements of a group, I use Generic Lists. They are faster and more efficient than ArrayList objects.

Faster Progress Bar

Progress bars can be easily implemented using Write-Progress, and they are a good idea for long-running scripts to provide feedback to the user.

Updating the progress bar adds a time penalty, though, so it is not wise to update a progress bar for every file processed. Instead, I am using a counter and the Modulo Operator (%) to update the progress bar only every 20 files.

Efficiently Calculating Number of Files

To be able to show a real progress bar, it is necessary to know the total number of files to be processed. Only then can you calculate how many percent have been processed at any given time.

A pipeline is a streaming construct, so the total number of files to be processed is not known. That’s why I am using an easy trick to collect all files at some point, and pass them on en-bloc:

& {
    end { ,@($input) }
  }

I am calling a scriptblock with just an End Block and make sure I am not using any attributes. This creates a Simple Function which automatically collects all piped input in $input.

$input is not an array but an enumerator. To pass all collected files en-bloc to the next pipeline command, I am wrapping $input in @() (effectively turning the enumerator into an array), and prepend it with a comma. This wraps the array with the files inside another array.

When this nested array is passed to the next pipeline command, the pipeline unwraps the Outer Array only, and the Inner Array is handed over to the next pipeline command in one piece.

This way, the next pipeline command can determine the total size of the array and then use a loop to process the content of the array.

Hashing Large Files Efficiently

Find-PsOneDuplicateFile employs a lot of tricks to limit hash calculation to only those files that are potential duplicates. That’s good because calculating hashes is expensive: the entire file needs to be read, and for large files, i.e. databases or video files, this can take minutes - per file, especially when you use slow network connections.

But why do we need to calculate the hash for the entire file content? When two files have identical size, and a significant part of its content is equal, then it is a fairly safe assumption that both files are identical.

Let’s not be mistaken: if you absolutely positively need to ensure that both files are identical to the last bit, then you must calculate the full hash. There is no way to avoid this. However, for most practical purposes, a partial hash is completely sufficient.

Calculating Partial Hashes

To easily calculate partial hashes, I created Get-PsOneFileHash: it comes with a number of useful parameters:

  • -StartPosition: Offset in bytes where reading the file should start. This way you can skip file headers that tend to be identical.
  • -Length: When the size of a file (in bytes) is greater than this value, a partial hash is calculated. If the file is smaller than this, the full hash is calculated. This way, calculating hashes is always quick, regardless of the total file size. Just make sure the number of bytes you choose provides a meaningful chunk of data. Don’t make it too small. The perfect value depends on the type of data. 100K seems to be a good value to start experimenting with.
  • -BufferSize: Number of bytes that are read in one chunk. The smaller this value, the more chunks are read. This saves memory. Larger values speed up the reading process. Just make sure this value is not larger than the value you specified in -Length.
  • -AlgorithmName: Choose the hash algorithm to use. SHA1 is the fastest algorithm for file hashing.
function Get-PsOneFileHash
{
    <#
        .SYNOPSIS
        Calculates a unique hash value for file content and strings, and is capable of calculating partial hashes to speed up calculation for large content

        .DESCRIPTION
        Calculates a cryptographic hash for file content and strings to identify identical content. 
        This can take a long time for large files since the entire file content needs to be read.
        In most cases, duplicate files can safely be identified by looking at only part of their content.
        By using parameters -StartPosition and -Length, you can define the partial content that should be used for hash calculation.
        Any file or string exceeding the size specified in -Length plus -StartPosition will be using a partial hash
        unless -Force is specified. This speeds up hash calculation tremendously, especially across the network.
        It is recommended that partial hashes are verified by calculating a full hash once it matters.
        So if indeed two large files share the same hash, you should use -Force to calculate their hash again.
        Even though you need to calculate the hash twice, calculating a partial hash is very fast and makes sure
        you calculate the expensive full hash only for files that have potential duplicates.

        .EXAMPLE
        Get-PsOneFileHash -String "Hello World!" -Algorithm MD5
        Calculates the hash for a string using the MD5 algorithm

        .EXAMPLE
        Get-PSOneFileHash -Path "$home\Documents\largefile.mp4" -StartPosition 1000 -Length 1MB -Algorithm SHA1
        Calculates the hash for the file content. If the file is larger than 1MB+1000, a partial hash is calculated,
        starting at byte position 1000, and using 1MB of data

        .EXAMPLE
        Get-ChildItem -Path $home -Recurse -File -ErrorAction SilentlyContinue | 
            Get-PsOnePartialFileHash -StartPosition 1KB -Length 1MB -BufferSize 1MB -AlgorithmName SHA1 |
            Group-Object -Property Hash, Length | 
            Where-Object Count -gt 1 |
            ForEach-Object {
                $_.Group | Select-Object -Property Length, Hash, Path
            } |
            Out-GridView -Title 'Potential Duplicate Files'
        Takes all files from the user profile and calculates a hash for each. Large files use a partial hash.
        Results are grouped by hash and length. Any group with more than one member contains potential
        duplicates. These are shown in a gridview.

        .LINK
        https://powershell.one
    #>


    [CmdletBinding(DefaultParameterSetName='File')]
    param
    (
        [Parameter(Mandatory,ValueFromPipeline,ValueFromPipelineByPropertyName,ParameterSetName='File',Position=0)]
        [string]
        [Alias('FullName')]
        # path to file with hashable content
        $Path,

        [Parameter(Mandatory,ValueFromPipeline,ParameterSetName='String',Position=0)]
        [string]
        # path to file with hashable content
        $String,

        [int]
        [ValidateRange(0,1TB)]
        # byte position to start hashing
        $StartPosition = 1000,

        [long]
        [ValidateRange(1KB,1TB)]
        # bytes to hash. Larger length increases accuracy of hash.
        # Smaller length increases hash calculation performance
        $Length = 1MB,

        [int]
        # internal buffer size to read chunks
        # a larger buffer increases raw reading speed but slows down
        # overall performance when too many bytes are read and increases
        # memory pressure
        # Ideally, length should be equally dividable by this
        $BufferSize = 32KB,

        [Security.Cryptography.HashAlgorithmName]
        [ValidateSet('MD5','SHA1','SHA256','SHA384','SHA512')]
        # hash algorithm to use. The fastest algorithm is SHA1. MD5 is second best
        # in terms of speed. Slower algorithms provide more secure hashes with a 
        # lesser chance of duplicates with different content
        $AlgorithmName = 'SHA1',

        [Switch]
        # overrides partial hashing and always calculates the full hash
        $Force
    )

    begin
    {
        # what's the minimum size required for partial hashing?
        $minDataLength = $BufferSize + $StartPosition

        # provide a read buffer. This buffer reads the file content in chunks and feeds the
        # chunks to the hash algorithm:
        $buffer = [Byte[]]::new($BufferSize)

        # are we hashing a file or a string?
        $isFile = $PSCmdlet.ParameterSetName -eq 'File'
    }

    
    process
    {
        # prepare the return object:
        $result = [PSCustomObject]@{
            Path = $Path
            Length = 0
            Algorithm = $AlgorithmName
            Hash = ''
            IsPartialHash = $false
            StartPosition = $StartPosition
            HashedContentSize = $Length
        }
        if ($isFile)
        {
            try
            {
                # check whether the file size is greater than the limit we set:
                $file = [IO.FileInfo]$Path
                $result.Length = $file.Length

                # test whether partial hashes should be used:
                $result.IsPartialHash = ($result.Length -gt $minDataLength) -and (-not $Force.IsPresent)
            }
            catch
            {
                throw "Unable to access $Path"
            }
        }
        else
        {
            $result.Length = $String.Length
            $result.IsPartialHash = ($result.Length -gt $minDataLength) -and (-not $Force.IsPresent)
        }
        # initialize the hash algorithm to use
        # I decided to initialize the hash engine for every file to avoid collisions
        # when using transform blocks. I am not sure whether this is really necessary,
        # or whether initializing the hash engine in the begin() block is safe.
        try
        {
            $algorithm = [Security.Cryptography.HashAlgorithm]::Create($algorithmName)
        }
        catch
        {
            throw "Unable to initialize algorithm $AlgorithmName"
        }
        try
        {
            if ($isFile)
            {
                # read the file, and make sure the file isn't changed while we read it:
                $stream = [IO.File]::Open($Path, [IO.FileMode]::Open, [IO.FileAccess]::Read, [IO.FileShare]::Read)

                # is the file larger than the threshold so that a partial hash
                # should be calculated?
                if ($result.IsPartialHash)
                {
                    # keep a counter of the bytes that were read for this file:
                    $bytesToRead = $Length

                    # move to the requested start position inside the file content:
                    $stream.Position = $StartPosition

                    # read the file content in chunks until the requested data is fed into the
                    # hash algorithm
                    while($bytesToRead -gt 0)
                    {
                        # either read the full chunk size, or whatever is left to read the desired
                        # total length:
                        $bytesRead = $stream.Read($buffer, 0, [Math]::Min($bytesToRead, $bufferSize))

                        # we should ALWAYS read at least one byte:
                        if ($bytesRead -gt 0)
                        {
                            # subtract the bytes read from the total number of bytes to read
                            # in order to calculate how many bytes need to be read in the next
                            # iteration of this loop:
                            $bytesToRead -= $bytesRead

                            # if there won't be any more bytes to read, this is the last chunk of data,
                            # so we can finalize hash generation:
                            if ($bytesToRead -eq 0)
                            {
                                $null = $algorithm.TransformFinalBlock($buffer, 0, $bytesRead)
                            }
                            # else, if there are more bytes to follow, simply add them to the hash
                            # algorithm:
                            else
                            {
                                $null = $algorithm.TransformBlock($buffer, 0, $bytesRead, $buffer, 0)
                            }
                        }
                        else
                        {
                            throw 'This should never occur: no bytes read.'
                        }
                    }
                }
                else
                {
                    # either the file was smaller than the buffer size, or -Force was used:
                    # the entire file hash is calculated:
                    $null = $algorithm.ComputeHash($stream)
                }
            }
            else
            {
                if ($result.IsPartialHash)
                {
                    $bytes = [Text.Encoding]::UTF8.GetBytes($String.SubString($StartPosition, $Length))
                }
                else
                {
                    $bytes = [Text.Encoding]::UTF8.GetBytes($String)
                }

                $null = $algorithm.ComputeHash($bytes)
            }

            # the calculated hash is stored in the prepared return object:
            $result.Hash = [BitConverter]::ToString($algorithm.Hash).Replace('-','')

            if (!$result.IsPartialHash)
            {
                $result.StartPosition = 0
                $result.HashedContentSize = $result.Length
            }
        }
        catch
        {
            throw "Unable to calculate partial hash: $_"
        }
        finally
        {
            if ($PSCmdlet.ParameterSetName -eq 'File')
            {
                # free stream
                $stream.Close()
                $stream.Dispose()
            }

            # free algorithm and its resources:
            $algorithm.Clear()
            $algorithm.Dispose()
        }
    
        # return result for the file
        return $result
    }
}

Finding Duplicate Files Fast

I adapted Find-PsOneDuplicateFile to use the new Get-PsOneFileHash and take advantage of partial hashes. This speeds up the search for duplicates tremendously when you use a lot of large files (i.e. audio or video files) and/or read files via slow network connections.

Here is a quick example that illustrates how to find duplicate files in record time. It reads a maximum of 100KB per file to calculate the hash, regardless of the actual file size:

$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# search entire Documents folder for duplicates:
$Path = [Environment]::GetFolderPath('MyDocuments')
# use partial hashes for files larger than 100KB
$result = Find-PSOneDuplicateFileFast -Path $Path -MaxFileSize 100KB 
$stopwatch.Stop()

[PSCustomObject]@{
  Seconds = $stopwatch.Elapsed.TotalSeconds
  Count = $result.Count
  }

$result | Out-GridView

IMPORTANT: make sure you ran the function definitions for Get-PsOneFileHash and Find-PsOneDuplicateFileFast before test-driving the example, or better yet: install the latest version of the module PSOneTools:

Install-Module -Name PSOneTools -Scope CurrentUser -MinimumVersion 2.3 -Force

Duplicate files are identified in just a few seconds:

  Seconds Count
  ------- -----
3,4419176   585

You may or may not see a hugely insane time difference when compared to Find-PsOneDuplicateFile. This entirely depends on the size of your files. With small files you may not notice any difference. When you check your multimedia archives, it may be a factor greater 1000x.

And when you look into $result, you find the piles of duplicate files:

$result
E34D9829D63269783502BF18DF71205F9A6C8102:1652224P {SQLite.Interop.dll, SQLite.Interop.dll, SQLite.Interop.dll}
8B6EB135A43397F21B4437ACED7A2F2CECF9FE52:59316    {PSGet.Resource.psd1, PSGet.Resource.psd1}                                  
EA81E6EF80E4719F68B301BA3C405BC9FD582405:235      {ReallySimpleDatabase.psm1, ReallySimpleDatabase.psm1}   

The key to the hashtable is the file hash, a colon, the file size, and optionally a “P” indicating a partial hash. To get to the actual duplicates, use the key:

$result['E34D9829D63269783502BF18DF71205F9A6C8102:1652224P']
Mode          LastWriteTime  Length Name
----          -------------  ------ ----
-a---- 08.06.2019     16:51 1652224 SQLite.Interop.dll
-a---- 08.06.2019     16:51 1652224 SQLite.Interop.dll
-a---- 08.06.2019     16:51 1652224 SQLite.Interop.dll

To get the full paths, use this:

$result['E34D9829D63269783502BF18DF71205F9A6C8102:1652224P'].FullName
C:\Users\tobia\OneDrive\Dokumente\Projekte\ReallySimpleDB\v2.0\ReallySimpleDatabase\Binaries\x64\SQLite.Interop.dll
C:\Users\tobia\OneDrive\Dokumente\Projekte\ReallySimpleDB\work\ReallySimpleDatabase\Binaries\x64\SQLite.Interop.dll
C:\Users\tobia\OneDrive\Dokumente\WindowsPowerShell\Modules\ReallySimpleDatabase\Binaries\x64\SQLite.Interop.dll

Any key ending with “P” is a partial hash. These files are very likely to be duplicates because they are of equal size and have 100KB identical content, however they could be different in other parts.

While this is highly unlikely for audio or video files, it may be different for databases or office documents. If in doubt, you need to calculate a full hash to check the entire file content.

The new function is called Find-PsOneDuplicateFileFast, and it exposes new parameters:

  • -MaxFileSize: The size (in bytes) when partial hashes should be used. If a file is larger than this value, partial hashes are calculated.
  • -TestPartialHash: When specified, and when there are potentially duplicate files that use the same partial hash, the function calculates the full hash to make sure the files are indeed identical. Use -TestPartialHash with care: you lose a lot of the time saving when you use it. For most use cases, identifying potentially duplicate files based on partial hashes is sufficient. You may want to look into more effective ways to double-check the reported files, i.e. check their creation times.
function Find-PSOneDuplicateFileFast
{
  <#
      .SYNOPSIS
      Identifies files with duplicate content and uses a partial hash for large files to speed calculation up

      .DESCRIPTION
      Returns a hashtable with the hashes that have at least two files (duplicates). Large files with partial hashes are suffixed with a "P".
      Large files with a partial hash can be falsely positive: they may in fact be different even though the partial hash is the same
      You either need to calculate the full hash for these files to be absolutely sure, or add -TestPartialHash.
      Calculating a full hash for large files may take a very long time though. So you may be better off using other
      strategies to identify duplicate file content, i.e. look at identical creation times, etc.

      .EXAMPLE
      $Path = [Environment]::GetFolderPath('MyDocuments')
      Find-PSOneDuplicateFileFast -Path $Path 
      Find duplicate files in the user documents folder

      .EXAMPLE
      Find-PSOneDuplicateFileFast -Path c:\windows -Filter *.log 
      find log files in the Windows folder with duplicate content

      .LINK
      https://powershell.one
  #>


  param
  (
    # Path of folder to recursively search
    [String]
    [Parameter(Mandatory)]
    $Path,
  
    # Filter to apply. Default is '*' (all Files) 
    [String]
    $Filter = '*',
    
    # when there are multiple files with same partial hash
    # they may still be different. When setting this switch,
    # full hashes are calculated which may take a very long time
    # for large files and/or slow networks
    [switch]
    $TestPartialHash,
    
    # use partial hashes for files larger than this:
    [int64]
    $MaxFileSize = 100KB
  )

  # get a hashtable of all files of size greater 0
  # grouped by their length
  
  
  # ENUMERATE ALL FILES RECURSIVELY
  # call scriptblocks directly and pipe them together
  # this is by far the fastest way and much faster than
  # using Foreach-Object:
  & { 
    try
    {
      # try and use the fast API way of enumerating files recursively
      # this FAILS whenever there is any "Access Denied" errors
      Write-Progress -Activity 'Acquiring Files' -Status 'Fast Method'
      [IO.DirectoryInfo]::new($Path).GetFiles('*', 'AllDirectories')
    }
    catch
    {
      # use PowerShell's own (slow) way of enumerating files if any error occurs:
      Write-Progress -Activity 'Acquiring Files' -Status 'Falling Back to Slow Method'
      Get-ChildItem -Path $Path -File -Recurse -ErrorAction Ignore
    }
  } | 
  # EXCLUDE EMPTY FILES:
  # use direct process blocks with IF (which is much faster than Where-Object):
  & {
    process
    {
      # if the file has content...
      if ($_.Length -gt 0)
      {
        # let it pass through:
        $_
      }
    }
  } | 
  # GROUP FILES BY LENGTH, AND RETURN ONLY FILES WHERE THERE IS AT LEAST ONE
  # OTHER FILE WITH SAME SIZE
  # use direct scriptblocks with own hashtable (which is much faster than Group-Object)
  & { 
    begin 
    # start with an empty hashtable
    { $hash = @{} } 

    process 
    { 
      # group files by their length
      # (use "length" as hashtable key)
      $file = $_
      $key = $file.Length.toString()
      
      # if we see this key for the first time, create a generic
      # list to hold group items, and store FileInfo objects in this list
      # (specialized generic lists are faster than ArrayList):
      if ($hash.ContainsKey($key) -eq $false) 
      {
        $hash[$key] = [Collections.Generic.List[System.IO.FileInfo]]::new()
      }
      # add file to appropriate hashtable key:
      $hash[$key].Add($file)
    } 
  
    end 
    { 
      # return only the files from groups with at least two files
      # (if there is only one file with a given length, then it 
      # cannot have any duplicates for sure):
      foreach($pile in $hash.Values)
      {
        # are there at least 2 files in this pile?
        if ($pile.Count -gt 1)
        {
          # yes, add it to the candidates
          $pile
        }
      }
    } 
  } | 
  # CALCULATE THE NUMBER OF FILES TO HASH
  # collect all files and hand over en-bloc
  & {
    end { ,@($input) }
  } |
  # GROUP FILES BY HASH, AND RETURN ONLY HASHES THAT HAVE AT LEAST TWO FILES:
  # use a direct scriptblock call with a hashtable (much faster than Group-Object):
  & {
    begin 
    {
      # start with an empty hashtable
      $hash = @{}
      
      # since this is a length procedure, a progress bar is in order
      # keep a counter of processed files:
      $c = 0
    }
      
    process
    {
      $totalNumber = $_.Count
      foreach($file in $_)
      {
      
        # update progress bar
        $c++
      
        # update progress bar every 20 files:
        if ($c % 20 -eq 0 -or $file.Length -gt 100MB)
        {
          $percentComplete = $c * 100 / $totalNumber
          Write-Progress -Activity 'Hashing File Content' -Status $file.Name -PercentComplete $percentComplete
        }
      
        # use the file hash of this file PLUS file length as a key to the hashtable
        # use the fastest algorithm SHA1, and use partial hashes for files larger than 100KB:
        $bufferSize = [Math]::Min(100KB, $MaxFileSize)
        $result = Get-PsOneFileHash -StartPosition 1KB -Length $MaxFileSize -BufferSize $bufferSize -AlgorithmName SHA1 -Path $file.FullName
        
        # add a "P" to partial hashes:
        if ($result.IsPartialHash) {
          $partialHash = 'P'
        }
        else
        {
          $partialHash = ''
        }
        
        
        $key = '{0}:{1}{2}' -f $result.Hash, $file.Length, $partialHash
      
        # if we see this key the first time, add a generic list to this key:
        if ($hash.ContainsKey($key) -eq $false)
        {
          $hash.Add($key, [Collections.Generic.List[System.IO.FileInfo]]::new())
        }
      
        # add the file to the approriate group:
        $hash[$key].Add($file)
      }
    }
      
    end
    {
      # remove all hashtable keys with only one file in them
      
      
      
      # do a detail check on partial hashes
      if ($TestPartialHash)
      {
        # first, CLONE the list of hashtable keys
        # (we cannot remove hashtable keys while enumerating the live
        # keys list):
        $keys = @($hash.Keys).Clone()
        $i = 0
        Foreach($key in $keys)
        {
          $i++
          $percentComplete = $i * 100 / $keys.Count
          if ($hash[$key].Count -gt 1 -and $key.EndsWith('P'))
          {
            foreach($file in $hash[$key])
            {
              Write-Progress -Activity 'Hashing Full File Content' -Status $file.Name -PercentComplete $percentComplete
              $result = Get-FileHash -Path $file.FullName -Algorithm SHA1
              $newkey = '{0}:{1}' -f $result.Hash, $file.Length
              if ($hash.ContainsKey($newkey) -eq $false)
              {
                $hash.Add($newkey, [Collections.Generic.List[System.IO.FileInfo]]::new())
              }
              $hash[$newkey].Add($file)
            }
            $hash.Remove($key)
          }
        }
      }
      
      # enumerate all keys...
      $keys = @($hash.Keys).Clone()
      
      foreach($key in $keys)
      {
        # ...if key has only one file, remove it:
        if ($hash[$key].Count -eq 1)
        {
          $hash.Remove($key)
        }
      }
       
      
       
      # return the hashtable with only duplicate files left:
      $hash
    }
  }
}