Speeding Up String Manipulation

Appending text to strings using “+=” is convenient but slow. Learn how to do string manipulation without slowing down PowerShell.

Appending text to strings using “+=” is convenient but slow. Learn how to do string manipulation without slowing down PowerShell.

There is nothing wrong with occasionally appending strings like this:

$text = "Some Text"
$text += " and some more text."

If you do this often, though, for example within a loop, you should reconsider. Internally, PowerShell treats strings as Character Arrays, so the same speed penalties apply that you experience with arrays in general.

By using a StringBuilder instead, you can considerably speed up your string manipulations, and when you apply all tricks, a script that took 68.5 seconds will run in just 0.13 seconds - more than 500x faster!

This article includes an unsolved mystery at the end. I don’t (yet) understand what is happening here. You are cordially invited to join the brainstorming and leave a comment at the end of this article if you have a theory.

Why Using “+=” With Strings Can Be Evil

Each time you use += on strings, behind the scenes PowerShell copies the entire original string into a new character array with enough additional space to hold the additional text, then adds the new text to the new character array.

What is perfectly ok for an occasional string concatenation becomes a major speed bump when done frequently. See for yourself: the code below concatenates a string in a loop that iterates 100.000 times:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

$text = ""

# appending a string often:
1..100000 | Foreach-Object {
	$text += "working on $_`r`n"
}

# check results:
$report = 'composed string of length {0} in {1:n1} seconds' 
$report -f $text.Length, $stopwatch.Elapsed.TotalSeconds

On my test system, at the end a string of length 1.788.895 bytes was composed in 68.5 seconds.

Admittedly, you seldom manipulate a string 100.000 times in your scripts. Then again, many scripts use strings to build a log file text, or write data from database records to strings. So in reality, whenever you use += on strings in a loop, chances are that you can speed up your script considerably.

Using a StringBuilder to Replace “+=”

A StringBuilder is a specialized object especially designed to compose and manipulate strings. It has explicit methods like AppendLine() to add new lines, and also sophisticated methods like Insert() and Remove() that help you insert and remove text at given offsets.

To speed up above script, all you need to do is replace the string with a StringBuilder, then replace the operator += with its method AppendLine():

Note the use of AppendLine(): this method automatically appends a new line at the end of the string so we were also able to remove the control characters at the end of the appended text. If you want to append text without a line feed, use Append() instead.

Both methods return the position where the append took place so make sure you send the results to $null.

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# create a StringBuilder
$text = [System.Text.StringBuilder]""

# appending a string often:
1..100000 | Foreach-Object {
    # replace operator += with AddLine():
	$null = $text.AppendLine("working on $_")
}

# when done, convert StringBuilder back to a string:
$stringText = $text.ToString()

# check results:
$report = 'composed string of length {0} in {1:n1} seconds' 
$report -f $stringText.Length, $stopwatch.Elapsed.TotalSeconds

To get back the final text from the StringBuilder, call its ToString() method. This way, the exact same string was composed - in just 6.1 seconds instead of 68.5 seconds, almost 12 x faster.

Pipeline Mystery: 500x Speed Increase

Recently, I introduced the Pipeline Trick which can speed up Foreach-Object tremendously. So to identify how much the pipeline overhead affects overall measurements, I applied the Pipeline Trick to the original script:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

$text = ""

# appending a string often:
1..100000 | . { process {
	$text += "working on $_`r`n"
}}

# check results:
$report = 'composed string of length {0} in {1:n1} seconds' 
$report -f $text.Length, $stopwatch.Elapsed.TotalSeconds

Note that you need to dot-source the scriptblock. If you use the call operator & instead, $text would turn into a private variable.

The script took almost as long as the original script: 64.6 seconds. My initial conclusion was that the pipeline overhead seems to be neglectable. Boy how I was wrong!

Combining Tricks Fires the Booster

Just to make sure, I then applied both tricks to the original script: replacing += with a StringBuilder, and replacing Foreach-Object with a direct scriptblock call. And now I was up for a big surprise:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# create a StringBuilder
$text = [System.Text.StringBuilder]""

# appending a string often 
# (using direct scriptblock calls)
1..100000 | & { process
    {
        # replace operator += with AddLine():
	    $null = $text.AppendLine("working on $_")
    }
}

# when done, convert StringBuilder back to a string:
$stringText = $text.ToString()

# check results:
$report = 'composed string of length {0} in {1:n1} seconds' 
$report -f $stringText.Length, $stopwatch.Elapsed.TotalSeconds

Bam, it felt like a rocket booster: the script produced the same results in insane 0.13 seconds. Remember how we started with 68.5 seconds? By adding a bit of cleverness to the code, I managed to make the script run more than 500x faster.

Something Must Be Wrong

It is late while I write this, and maybe I am overlooking a typo or something obvious. I simply have no explanation why using += slows down the code, regardless of whether I use Foreach-Object, foreach or direct ScriptBlock invocation. If I am not overlooking something obvious and the numbers are correct, then I guess we are on the track of another issue with PowerShell that when addressed might lead to better performance in the future.

It is not the Pipeline!

The issue I encountered with script #3 is not related to the PowerShell Pipeline. I completely removed the pipeline and used a foreach loop instead:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# create a StringBuilder
$text = ""

# appending a string often 
# using foreach and avoiding the pipeline
foreach($_ in (1..100000))
{
    # replace operator += with AddLine():
	$text += "working on $_`r`n"
}

# check results:
$report = 'composed string of length {0} in {1:n2} seconds' 
$report -f $text.Length, $stopwatch.Elapsed.TotalSeconds

This script takes 66.9 seconds to run, so the same time penalty hits. Only when I remove += and replace it with a StringBuilder, the script shows maximum performance and now takes 0.13 seconds.

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

# create a StringBuilder
$text = [System.Text.StringBuilder]""

# appending a string often 
# (using foreach)
foreach($_ in (1..100000)) 
{
    # replace operator += with AddLine():
	$null = $text.AppendLine("working on $_")
}

# when done, convert StringBuilder back to a string:
$stringText = $text.ToString()

# check results:
$report = 'composed string of length {0} in {1:n2} seconds' 
$report -f $stringText.Length, $stopwatch.Elapsed.TotalSeconds

Checking the Numbers…

Here are the results I got from the different tests:

# Optimization Time Factor
1 none 68.5s  
2 StringBuilder 6.1s 12x
3 Pipeline Trick 64.6s 1x
4 foreach loop 66.9s 1x
5 StringBuilder, foreach loop 0.13s >500x
6 StringBuilder, Pipeline Trick 0.13s >500x

The Mystery

Here is the core of the mystery:

  • Using a StringBuilder instead of += improves performance by 12x (script 2)
  • Using foreach or direct ScriptBlock calls does not improve performance at all unless a StringBuilder is used instead of +=. Then however it improves performance by >500x

The expected result would have been that scripts #3 and #4 would show a somewhat better performance but they don’t. So the assumption is that using += is not just evil because of how the string is appended, but also evil because in addition it prevents some sort of other optimization.

Mystery Boiled Down

I know it’s complex, so check out the performance benefit you typically gain when moving away from Foreach-Object:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()


# using the pipeline
1..100000 | ForEach-Object {
    # replace operator += with AddLine():
	$x++
}

# all results end up in the array:
$report = '$x={0} in {1:n2} seconds' 
$report -f $x, $stopwatch.Elapsed.TotalSeconds

It requires almost 10 seconds. Now take this one which uses foreach instead of the pipeline:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()


# using the pipeline
1..100000 | . { process {
    # replace operator += with AddLine():
	$x++
}}

# all results end up in the array:
$report = '$x={0} in {1:n2} seconds' 
$report -f $x, $stopwatch.Elapsed.TotalSeconds

It only takes 0.11 seconds, 90x faster. There is a huge (expected) performance benefit that you can also get from PowerShell Pipelines by using the Pipeline Trick.

Using “+=” Hits Loops

Now do the same with += on strings. The assumption is that things will much slower but that foreach will continue to show a massive improvement. Let’s first use Foreach-Object:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

$x=""
# using the pipeline
1..100000 | ForEach-Object {
	$x+=$_
}

# all results end up in the array:
$report = '$x={0} in {1:n2} seconds' 
$report -f $x.Length, $stopwatch.Elapsed.TotalSeconds

It takes 23 seconds. Let’s use foreach again instead:

# use a stopwatch to measure performance
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

$x=""
# using the pipeline
foreach($_ in (1..100000)) {
	$x+=$_
}

# all results end up in the array:
$report = '$x={0} in {1:n2} seconds' 
$report -f $x.Length, $stopwatch.Elapsed.TotalSeconds

It takes 20 seconds to run. So the performance gain is just 1.15x.

The first two examples showed that PowerShell needs roughly 10 seconds to send 100.000 objects through the pipeline, and foreach can iterate such a loop in a fraction of a second.

So why perform foreach and Foreach-Object almost the same once the operator += is used on strings?

Theories

There are a number of theories evolving. Fellow expert PowerSheller Mathias Jessen, better known as @IISResetMe, concluded: ” I suspect that the memory pressure from allocating all of these long strings impacts the command runtime as well, so the quick .process{} iteration you expect slows down as well”.

He ran a 10K rounds version through Measure-Script (from the awesome and free module MeasureScript) and plotted the runtime of each “+=” operation. Interestingly, around 2600 iterations in (string size ~87KB) it goes from ~0.15ms to spikes of 11-14ms per suffix.

Image

Join the Discussion!

If you have an idea or clue, please leave a comment below so others can join the discussion. Of course you can also twitter but it would really be nice to have all thoughts and comments right next to the article so we all can enjoy them: