Speeding Up Arrays - powershell.one

In code reviews, one common mistake stands out: using += on arrays. Learn how to 3.5x speed up your PowerShell scripts by avoiding +=.

Common Pattern: Filling Arrays With “+=”

It’s amazing just how often code reviews reveal the technique below: a loop processes something, produces some results, and collects them in a “bucket” array using the += operator:

# an empty array is used as a bucket to collect results:
$bucket = @()

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# a loop does something, i.e. scans computers, processes database records,
# examines files, etc.:
1..10000 | Foreach-Object {
	# result are added to the bucket using the "+=" operator
	$bucket += "I am adding $_"
}

# all results end up in the array:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f $bucket.Count, $stopwatch.Elapsed.TotalSeconds

When you run this script, on my system processing these 10.000 elements took 2.7 seconds.

If the example script above takes even way longer on your system, in the range of 15 to 25 seconds, you may want to look at the mysterious Pipeline Problem analyzed last week.

Much Faster: Using ArrayLists

To fix the speed problem, smart-ass scripters (who generally seem to have a solid .NET developer background) replace the default [Object[]] arrays used by PowerShell with an ArrayList and maliciously smile at you while pointing to the script performance:

# an empty ArrayList is used as a bucket to collect results:
$bucket = [System.Collections.ArrayList]@()

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# a loop does something, i.e. scans computers, processes database records,
# examines files, etc.:
1..10000 | Foreach-Object {
	# result are added using the Add() method
	$null = $bucket.Add("I am adding $_")
}

# all results end up in the array:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f $bucket.Count, $stopwatch.Elapsed.TotalSeconds

Instead of 2.1 seconds, this script takes a mere 0.6 seconds now - while producing the very same results. A 350% speed increase with almost no effort.

Faster and Easier: Leverage PowerShell

You don’t need to necessarily be a .NET developer and know about ArrayLists in the first place. A solid PowerShell background gets you the very same results. The next script performs equally well yet is much shorter and simpler than the previous one and doesn’t require any .NET wizardry:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# let PowerShell handle object creation
$bucket = 1..10000 | Foreach-Object {
	# simply return the result
    # PowerShell wraps all results in an array automatically
	"I am adding $_"
}

# all results end up in the array:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f $bucket.Count, $stopwatch.Elapsed.TotalSeconds

All you need to remember is that PowerShell automatically produces arrays for you when you return more than one element. So instead of handling array creation yourself, leave it to PowerShell.

If you want to make sure that $bucket always is an array, enclose the loop in @(). If the loop yields only one (or none) results, $bucket would not be an array otherwise:
$bucket = $(1..10000 | Foreach-Object {
	# simply return the result
    # PowerShell wraps all results in an array automatically
	"I am adding $_"
})

How to Deal With Multiple Arrays?

The tricks above work great if your loop needs to return just one item per iteration. But what if the loop needs to return multiple items? Here is the boil-down of what I often see in code reviews:

# an empty array is used as a bucket to collect results:
$bucket1 = @()
$bucket2 = @()

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# a loop does something, i.e. scans computers, processes database records,
# examines files, etc.:
1..10000 | Foreach-Object {
	# result are added to the bucket using the "+=" operator
	$bucket1 += "I am adding $_"
    $bucket2 += $_*2
}

# all results end up in the array:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f ($bucket1.Count + $bucket2.Count), $stopwatch.Elapsed.TotalSeconds

Execution time doubled to 3.8 seconds. One important insight from this: the more “buckets” you use per loop, the higher is the time penalty. It raises in a linear way. Now how would you optimize this?

Here, the .NET smart asses indeed have a cutting edge, and using ArrayLists is the best bet:

Using ArrayLists instead

PowerShell automatically adds all return values to one array, so once you need to return more than one array, you can’t easily use this meachism any longer. What you can always do though is use the ArrayList trick and replace the operator += by Add():

# an empty array is used as a bucket to collect results:
$bucket1 = [System.Collections.ArrayList]@()
$bucket2 = [System.Collections.ArrayList]@()

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# a loop does something, i.e. scans computers, processes database records,
# examines files, etc.:
1..10000 | Foreach-Object {
	# use Add() instead of "+=" to add elements to the "buckets":
	$null = $bucket1.Add("I am adding $_")
    $null = $bucket2.Add($_*2)
}

# analyze results:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f ($bucket1.Count + $bucket2.Count), $stopwatch.Elapsed.TotalSeconds

Surprisingly, execution time now is just 0.6 seconds - even though we add elements to two ArrayLists instead of one now per iteration!

Using two “buckets” cost you the same time as using one, or put differently: using += on arrays comes at a significant cost per use whereas invoking Add() costs you almost nothing in terms of speed.

Using Pure PowerShell

Of course you could solve it with pure PowerShell, at least if you produce the same number of information per iteration:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# a loop does something, i.e. scans computers, processes database records,
# examines files, etc.:
$bucket = 1..10000 | Foreach-Object {
	# return custom objects with different properties:
    [PSCustomObject]@{
	    InfoText = "I am adding $_"
        Number = $_*2
    }
}

# all results end up in the array:
$report = '{0} elements collected in {1:n1} seconds' 
$report -f $bucket.Count, $stopwatch.Elapsed.TotalSeconds

This script takes just 0.6 seconds, as the one before. There is only one array now that contains objects that combine all the information you want to return:

PS> $bucket[0..3]

InfoText      Number
--------      ------
I am adding 1      2
I am adding 2      4
I am adding 3      6
I am adding 4      8

Why ”+=” Is Evil

When you look at the scripts, it becomes evident that the slowness is created by the += operator. It makes something appear easy that really isn’t easy at all: extending an array.

Default PowerShell arrays are always of fixed size, so to add new elements to them, the += operator really needs to create a new array with one more element, and copy all the old elements from the old array into the new array - over an over again, once per iteration.

Of course that takes a lot of hard work and slows down your script. So with arrays, make sure you avoid += if you can.

Both the ArrayList and PowerShell’s built-in array-creation can append arrays without having to re-create them. That’s why they both are much faster, and why they are actually performing equally well.

The operator += isn’t always evil: when you add numeric values, it is perfectly ok to use: $a = 1; $a += 100. The moment you apply += to arrayish objects you are hit by a performance penalty. That applies to strings as well. Using $text = 'Hello'; $text += ' World!' is performing badly because a string really is a character array, and PowerShell must re-create this array over again. That’s why we look at optimizing extensive string concatenation in our next part.

PREVIOUSPipeline

NEXTStrings