Speeding Up Group-Object

There is a design flaw in Group-Object. With a workaround, your scripts can be up to 50x faster and still 2x faster on PowerShell Core.

Group-Object groups objects based on a property you choose. Because of a design flaw, this can be extremely slow in Windows PowerShell, and still slower than necessary in PowerShell 7.

In this article, I’ll look at what Group-Object is typically used for, and how you can code around it to improve performance. This is especially important on Windows PowerShell where a design flaw can hit you bad: a script that took 43 sec. now takes under 0.8 sec.

Even on PowerShell 7, replacing Group-Object with a few lines of code can cut execution time in half.

The Problem

Let’s first make clear: Group-Object is an awesome and versatile cmdlet that can do a lot of crazy stuff. It is also performing much better on PowerShell 7. Whether you are ok with its performance depends on your use case. In my use case, I needed to do something about it.

I came across the Group-Object issue while trying to identify duplicate files. The only safe way to identify duplicate file content is to calculate its File Hash (which is hugely expensive). So to speed things up, I thought it would be clever to apply the only other safe assumption first: if two files are identical, then their size must be equal. Determining file size is easy and fast and a great way to limit the number of files that require a file hash calculation.

Group-Object is Paralyzing Scripts

To pre-filter the files, I grouped all files by length and excluded all groups with only one file in them - since there wouldn’t be any other file with the same length, these files could not have any duplicates for sure.

To my astonishment, pre-filtering took even longer than before. It boiled down to Group-Object that slowed things down:

# reading all files into a variable
# this is only done to isolate the time "Group-Object" needs
# for the grouping. In production, "Get-ChildItem" would of course
# pipe directly into "Group-Object":
$AllFiles = Get-ChildItem -Path $home -File -Recurse -Force -ErrorAction SilentlyContinue

$stopwatch = [Diagnostics.StopWatch]::StartNew()

# get all files from the profile folder...
$suspicious = $AllFiles |
  #...group files by length
  Group-Object -Property Length |
  # exclude any group with just one file in them
  Where-Object Count -gt 1
$stopwatch.Stop()

$template = '{0:n0} files found out of {1:n0} ({2:n1} sec.)' 
$template -f $suspicious.Count, $AllFiles.Count, $stopwatch.Elapsed.TotalSeconds

I conducted the test multiple times, and on my Windows PowerShell test system, grouping consistently took more than 40 seconds:

6,893 files found out of 81,251 (43.4 sec.)
6,893 files found out of 81,251 (41.5 sec.)
6,893 files found out of 81,358 (51.6 sec.)

These days, it is wise to repeat any performance test in PowerShell 7 since any improvements made in the past three years surface there. As it turns out, the design flaw in Group-Object was indeed fixed in PowerShell 7:

6,894 files found out of 81,267 (1.3 sec.)
6,894 files found out of 81,267 (1.7 sec.)
6,894 files found out of 81,267 (1.4 sec.)

Well, sort of. 1.3 seconds is still a long time. At the end of this article, the code will run twice as fast on PowerShell 7.

What’s Wrong with Group-Object

The original Group-Object (that is still used in Windows PowerShell 5.1) suffers from an exponential design flaw. It goes unnoticed when you group only a few hundred objects. But it can quickly cost you minutes or crash PowerShell altogether when more objects are grouped.

That’s the kind of problem where things seem perfect during design time, and crash once you deploy your script to production.

I spare you the insane details of the design flaw here. At psconf.eu 2019, Staffan Gustafsson and I had a hot night of debate about it, and if you want to learn more about the freaking details, watch his session:

I believe it was Staffan, too, who worked on the fix in PowerShell 7.

Data Analysis

Group-Object is typically used for two scenarios:

  • Data Analysis: you want to find out how many objects of a given kind you have
  • Data Separation: you want to find out which objects of a given kind you have

I’ll take a look at both scenarios, show you examples, and then how you can get the very same results without Group-Object in drastically less time.

Let’s first look at how Group-Object is used for Data Analysis, and how you can make this run so much faster.

Identifying Common System Errors

A simple example for Data Analysis would be to find out what your most frequent sources of errors are:

$today = Get-Date
$lastWeek = $today.AddDays(-7)

Get-EventLog -LogName System -EntryType Error -After $lastWeek |
  Group-Object -Property Source -NoElement |
  Sort-Object -Descending

Analyzing Folder Content

Or what the types of files are that you have in a folder:

Get-ChildItem -Path $env:windir -File |
  Group-Object -Property Extension -NoElement |
  Sort-Object -Property Count -Descending

Common Principles

In essence, you take a flat series of data, and based on one or more properties, you pile them. For this, Group-Object uses the parameter -NoElement: just return the size of the pile, not the pile itself.

Faster Group-Object for Data Analysis

To speed up this type of processing, use a hashtable internally to build the groups, then emit the number of collected objects per pile. Group-ObjectCount provides this kind of result:

function Group-ObjectCount
{
  param
  (
    [string[]]
    $Property,

    [switch]
    $NoElement
  )

  begin
  {
    # create an empty hashtable
    $hashtable = @{}
  }


  process
  {
  	# create a key based on the submitted properties, and turn
  	# it into a string
    $key = $(foreach($prop in $Property) { $_.$prop }) -join ','
        
    # check to see if the key is present already
    if ($hashtable.ContainsKey($key) -eq $false)
    {
      # add an empty array list 
      $hashtable[$key] = [Collections.Arraylist]@()
    }

    # add element to appropriate array list:
    $null = $hashtable[$key].Add($_)
  }

  end
  {
    # for each key in the hashtable, 
    foreach($key in $hashtable.Keys)
    {
      if ($NoElement)
      {
          # return one object with the key name and the number
          # of elements collected by it:
          [PSCustomObject]@{
            Count = $hashtable[$key].Count
            Name = $key
          }
      }
      else
      {
        # include the content
        [PSCustomObject]@{
            Count = $hashtable[$key].Count
            Name = $key
            Group = $hashtable[$key]
          }
      }
    }
  }
}

Replace Group-Object with Group-ObjectCount in any of the examples above, and you should see the very same results. Since the examples only group a few items, there shouldn’t be much of a performance difference.

Performance Benefits

However, when you replace Group-Object with Group-ObjectCount in my initial example where I was trying to prefilter files, there is an astonishing difference:

# with Group-Object:
6,914 files found out of 81,364 (41.3 sec.)
# with Group-ObjectCount:
6,914 files found out of 81,367 (0.8 sec.)
6,914 files found out of 81,367 (0.8 sec.)
6,914 files found out of 81,367 (0.8 sec.)

The script now takes 0.8 seconds instead of 41.3 seconds: that is more than 50x faster. When I tested in PowerShell 7 where Group-Object is fixed, Group-ObjectCount was still twice as fast.

Data Separation

A second use case for Group-Object is Data Separation: Group-Object returns a hashtable so you can access the grouped objects. This, too, can run much faster as you’ll see at the end.

Grouping Services

A simple example of data separation would be to separate running processes from stopped processes:

$allServices = Get-Service |
  Group-Object -Property Status -AsHashTable -AsString

# return all running services
$allServices.Running | Out-GridView -Title 'Running'
# alternate syntax, return all stopped services:
$allServices['Stopped'] | Out-GridView -Title 'Running'

Grouping Results from PowerShell Remoting

You can as well separate the data returned from multiple servers by Invoke-Command :

$serverList = 'dc1','fs12','fs13','db4'

$hotfixes = Invoke-Command { Get-HotFix } -Computername $server |
  # group results by PSComputerName (which was automatically
  # added by Invoke-Command) and return a hashtable
  Group-Object -Property PSComputerName -AsHashTable -AsString

# access data per server 
$hotfixes['dc1']
$hotfixes['fs12']

# output hotfixes per server in a separate gridview:
$hotfixes.Keys | ForEach-Object {
    $hotfixes[$_] | Out-GridView -Title $_
}

Common Principles

In essence, for data separation you return the internal hashtable altogether that Group-Object used for grouping. For this, Group-Object uses the parameter -AsHashTable in conjunction with -AsString.

Always use -AsString when you use -AsHashTable to ensure that the hashtable key is in fact a string. Without -AsString, Group-Object uses typed keys, and the type of key depends on what was found in the property of the object that you are grouping.

If the grouping property contains a string anyway (as with PSComputerName in the example above where remoting results were separated), -AsString isn’t important.

If the grouping property however contains something other than a string (as with Status in the example above where services were separated based on their State), -AsString is absolutely crucial. Try it yourself, and remove -AsString from the example: it will no longer return any data.

Without -AsString, you would have to use the typed key on the returned hashtable:

$allServices[[System.ServiceProcess.ServiceControllerStatus]::Running]

Faster Group-Object for Data Separation

Surprisingly, creating a faster Group-Object for Data Separation is even easier than for Data Analysis. Internally, Group-Object uses a mechanism like a hashtable to create the groups, and for Data Separation, this very hashtable is returned. No additional processing.

So Group-ObjectHashtable works just like Group-ObjectCount but skips the last part and returns its hashtable as-is:

function Group-ObjectHashtable
{
  param
  (
    [string[]]
    $Property
  )

  begin
  {
    # create an empty hashtable
    $hashtable = @{}
  }


  process
  {
  	# create a key based on the submitted properties, and turn
  	# it into a string
    $key = $(foreach($prop in $Property) { $_.$prop }) -join ','
        
    # check to see if the key is present already
    if ($hashtable.ContainsKey($key) -eq $false)
    {
      # add an empty array list 
      $hashtable[$key] = [Collections.Arraylist]@()
    }

    # add element to appropriate array list:
    $null = $hashtable[$key].Add($_)
  }

  end
  {
    # return the entire hashtable:
    $hashtable
  }
}

Note that Group-ObjectHashtable is not a universal replacement for Group-Object: I did not bother to implement -AsString and -AsHashtable. Instead, Group-ObjectHashtable always uses string keys and always returns a hashtable. If you really want a complete replacement for Group-Object, it is not too hard to blend both parts into one.

Rewriting Group-Object would miss the point of this article, though. Aside from the obvious bug that existed in Windows PowerShell, Group-Object is designed to be versatile and easy-to-use for a wide range of use cases.

If you implemented all of its features, even the edgy ones, your own Group-Object implementation would probably run just as fast (or slow) as the original.

Instead, this article encourages you to consider specific solutions that target exactly what you need, and by “limiting the problem space” gain considerable extra performance.

Performance Benefits

Let’s check again performance with a lot of objects to group: the code below takes all files from your user profile and produces a hashtable on file extensions:

$AllFiles = Get-ChildItem -Path $home -File -Recurse -Force -ErrorAction SilentlyContinue

$stopwatch = [Diagnostics.StopWatch]::StartNew()
$result = $AllFiles | Group-Object -Property Extension -AsHashTable -AsString
$stopwatch.Stop()

$template = '{0:n0} extensions ({1:n2} sec.)' 
$template -f $result.Count, $stopwatch.Elapsed.TotalSeconds

This script took 0.68 sec on my Windows PowerShell test machine. By replacing Group-Object with Group-ObjectHashtable (and removing the parameters -AsHashTable and -AsString), this was cut down to 0.17 sec.

Custom Implementation

If you plan to do something especially expensive - like my plan to find all files that have at least another file of equal size - it can pay off to write a custom implementation. Since the script processes a lot of data, any speed improvement is very welcome.

So here is my final implementation:

$stopwatch = [Diagnostics.StopWatch]::StartNew()

# get all files from the profile folder...
$suspicious = $AllFiles |
  #...group files by length
  Group-ObjectHashtable -Property Length |
  # exclude any group with just one file in them
  ForEach-Object { 
        # workaround for: | Select-Object -expandProperty Keys
        # which seems to be broken in PS7-preview5

        # create a COPY of all keys. That's important because
        # I am going to CHANGE the hashtable in a loop next:
        $keys = $_ | & { $_.Keys }

        # examine all keys...
        foreach($key in $keys)
        {
            # ...and if the key contains just one file,
            # remove the key:
            if ($_[$key].Count -eq 1)
            {
                $null = $_.Remove($key)
            }
        }
        $_ 
    }
$stopwatch.Stop()

$template = 'Filter {0:n0} files found out of {1:n0} ({2:n1} sec.)' 
$template -f $suspicious.Count, $AllFiles.Count, $stopwatch.Elapsed.TotalSeconds

And these are the results:

Approach Windows PowerShell PowerShell 7
Group-Object 64.9 sec 1.6 sec
Group-ObjectCount 1.2 sec 1.1 sec
Group-ObjectHashtable,
then remove keys with filecount=1 manually
0.5 sec 0.5 sec

In the end, a script initially taking more than a minute on Windows PowerShell now runs in a half-second, 130x faster than before, and still more than 3x faster in PowerShell 7.

Conclusion

A bug in Group-Object exists in Windows PowerShell that exponentially slows down grouping. When you group double the number of items, execution time quadruples. Grouping a large number of objects can therefore waste many minutes or even crash your script.

In PowerShell 7, this bug is fixed.

  • If you need to group only a few dozen objects, keep using Group-Object.
  • With larger numbers of objects, on Windows PowerShell you should always avoid Group-Object and instead use one of the methods described above.
  • On PowerShell 7, you can safely use Group-Object with any number of objects. There is a speed penalty but it is an acceptable price you pay for a versatile and easy-to-use command.
  • If your script is processing a very large number of objects and/or if you would like to enjoy maximum performance, use the methods above to tailor grouping exactly to your needs. It isn’t rocket science at all, and often - even on PowerShell 7 with the bug-free Group-Object - you can cut execution time in half.