Tokenizing PowerShell Scripts - powershell.one

By turning PowerShell code into tokens and structures, you can find errors, auto-document your code, and create powerful refactoring tools.

Colorful World of Tokens

Whenever you load PowerShell code into specialized editors, the code gets magically colored, and each color represents a given token type. The colors can help you understand how PowerShell interprets your code.

Generic editors without a built-in PowerShell Engine like notepad++ or VSCode use complex regular expressions to try and identify the correct tokens. A 100% precise tokenization however comes directly from the PowerShell Parser and is not the result of generic RegEx rules. In this article series we’ll look at all the goodness the PowerShell Parser is willing to share with you.

At the end of today, you get a new command: Test-PSOneScript parses one - or thousands - of PowerShell files and returns always 100% accurate tokens in no time. It is part of our PSOneTools module, so just install the latest version to get your hands on the command, or use the source code presented later in this article.

Install-Module -Name PSOneTools -Scope CurrentUser -Force

With tokens you can do a whole bunch of interesting things, for example:

Auto-document code and create lists of variables, commands, or method calls found in a script
Identify syntax errors that make the parser choke
Perform a security analysis and identify scripts using risky commands

PSParser Overview

The PSParser is the original parser built into the early versions of PowerShell. Even though it is old, it is still part of all PowerShell versions and very useful because of its simplicity. It distinguishes 20 different token types:

PS> [Enum]::GetNames([System.Management.Automation.PSTokenType]).Count
20

PS> [Enum]::GetNames([System.Management.Automation.PSTokenType]) | Sort-Object
Attribute
Command
CommandArgument
CommandParameter
Comment
GroupEnd
GroupStart
Keyword
LineContinuation
LoopLabel
Member
NewLine
Number
Operator
Position
StatementSeparator
String
Type
Unknown
Variable

When you use PSParser to tokenize PowerShell code, it reads your code character by character and groups the characters into meaningful words, the tokens. If the PSParser encounters characters it isn’t expecting, it generates Syntax Errors, i.e. when a string starts with double-quotes but ends with single-quotes.

Tokenizing PowerShell Code

Use Tokenize() to tokenize PowerShell code. Here is a simple example:

# the code that you want tokenized:
$code = {
  # this is some test code
  $service = Get-Service |
    Where-Object Status -eq Running
}


# create a variable to receive syntax errors:
$errors = $null
# tokenize PowerShell code:
$tokens = [System.Management.Automation.PSParser]::Tokenize($code, [ref]$errors)

# analyze errors:
if ($errors.Count -gt 0)
{
  # move the nested token up one level so we see all properties:
  $syntaxError = $errors | Select-Object -ExpandProperty Token -Property Message
  $syntaxError
}
else
{
  $tokens
}

Tokenize() expects the code you want to tokenize, plus an empty variable that it can fill with any syntax errors. Because the variable $errors is empty when Tokenize starts, and gets filled while the method parses the code, it needs to be submitted by reference (by memory pointer) which in PowerShell is done through [ref].

When Tokenize() completes, you receive all tokens as return value in $tokens, plus any syntax errors in $errors.

Looking at Tokens

This is what the first three token returned in $tokens look like:

PS> $tokens[0..2]


Content     : 
              
Type        : NewLine
Start       : 0
Length      : 2
StartLine   : 1
StartColumn : 1
EndLine     : 2
EndColumn   : 1

Content     : # this is some test code
Type        : Comment
Start       : 4
Length      : 24
StartLine   : 2
StartColumn : 3
EndLine     : 2
EndColumn   : 27

Content     : 
              
Type        : NewLine
Start       : 28
Length      : 2
StartLine   : 2
StartColumn : 27
EndLine     : 3
EndColumn   : 1

Each token is represented by a PSToken object which returns the token content as string, the token type, and the exact position where the token was found.

How Syntax Errors Work

If the parser encounters unexpected characters while parsing the code, it generates a syntax error. The parser continues parsing, so there can be multiple syntax errors returned.

Let’s create a syntax error and send a string to the parser that is missing its ending quote.

To send faulty code to the parser, you cannot use a scriptblock though because scriptblocks are smart and only accept formally correct PowerShell code. That’s why you have to send your faulty PowerShell code to the parser as a string instead of a scriptblock.

# the code that you want tokenized:
$code = "
  'Hello
"

When you run the script again, it now returns the syntax error(s):

PS> $syntaxError


Message     : The string is missing the terminator: '.
Content     : 'Hello
              
Type        : Position
Start       : 4
Length      : 8
StartLine   : 2
StartColumn : 3
EndLine     : 3
EndColumn   : 1

Improving Parser Error Objects

The parser emits a PSParseError object per syntax error which looks like this:

PS> $errors

Token                                Message                                 
-----                                -------                                 
System.Management.Automation.PSToken The string is missing the terminator: '.

Unfortunately, the token details are hidden inside the property Token. So I use a little-known trick to make all properties visible immediately:

Select-Object supports the use of -Property and -ExpandProperty at the same time. So I used -ExpandProperty to take the PSToken object out of Token, plus used -Property to attach the original property Message to the extracted token. As a result, all properties show up immediately:

PS> $errors | Select-Object -ExpandProperty Token -Property Message


Message     : The string is missing the terminator: '.
Content     : 'Hello
              
Type        : Position
Start       : 4
Length      : 8
StartLine   : 2
StartColumn : 3
EndLine     : 3
EndColumn   : 1

Examining Real Scripts

To examine real file-based scripts, simply embed the logic from above inside a pipeline-aware function. Test-PSOneScript does exactly this and makes parsing PowerShell files a snap:

function Test-PSOneScript
{
  <#
      .SYNOPSIS
      Parses a PowerShell Script (*.ps1, *.psm1, *.psd1)

      .DESCRIPTION
      Invokes the simple PSParser and returns tokens and syntax errors

      .EXAMPLE
      Test-PSOneScript -Path c:\test.ps1
      Parses the content of c:\test.ps1 and returns tokens and syntax errors

      .EXAMPLE
      Get-ChildItem -Path $home -Recurse -Include *.ps1,*.psm1,*.psd1 -File |
         Test-PSOneScript |
         Out-GridView

      parses all PowerShell files found anywhere in your user profile

      .EXAMPLE
      Get-ChildItem -Path $home -Recurse -Include *.ps1,*.psm1,*.psd1 -File |
         Test-PSOneScript |
         Where-Object Errors

      parses all PowerShell files found anywhere in your user profile
      and returns only those files that contain syntax errors

      .LINK
      https://powershell.one
  #>


  param
  (
    # Path to PowerShell script file
    # can be a string or any object that has a "Path" 
    # or "FullName" property:
    [String]
    [Parameter(Mandatory,ValueFromPipeline)]
    [Alias('FullName')]
    $Path
  )
  
  begin
  {
    $errors = $null
  }
  process
  {
    # create a variable to receive syntax errors:
    $errors = $null
    # tokenize PowerShell code:
    $code = Get-Content -Path $Path -Raw -Encoding Default
    
    # return the results as a custom object
    [PSCustomObject]@{
      Name = Split-Path -Path $Path -Leaf
      Path = $Path
      Tokens = [Management.Automation.PSParser]::Tokenize($code, [ref]$errors)
      Errors = $errors | Select-Object -ExpandProperty Token -Property Message
    }  
  }
}

Parsing Individual Files

To parse an individual file, simply submit its path to Test-PSOneScript. It immediately returns the tokens and any syntax errors (if present):

$Path = "C:\Users\tobia\test.ps1"
$result = Test-PSOneScript -Path $Path

Checking for Errors

Let’s start with checking whether the script file has syntax errors:

PS> $result.Errors.Count -gt 0
False

To get a list of all token types present in the script, try this (the output may vary depending on the actual code in your script file, of course):

PS> $result.Tokens.Type | Sort-Object -Unique
Command
CommandParameter
CommandArgument
Number
String
Variable
Member
Type
Operator
GroupStart
GroupEnd
Keyword
Comment
NewLine

Creating a List of Used Variables

To get a list of all variables used in the script, simply filter for token type Variable:

PS> $result.Tokens | 
  Where-Object Type -eq Variable | 
  Sort-Object -Property Content -Unique | 
  ForEach-Object { '${0}' -f $_.Content}

$_ldaptype
$_SortedReportProp
$AD_Capabilities
$AD_CreateDiagrams
$AD_CreateDiagramSourceFiles
$AD_DomainGPOs
...
$xlEqual
$zipPackage
$ZipReport
$ZipReportName

Creating a List of Used Commands

Likewise, if you’d like to get a list of commands used by the script, filter for the appropriate token type (Command):

PS> $result.Tokens | 
  Where-Object Type -eq Command | 
  Sort-Object -Property Content -Unique | 
  Select-Object -ExpandProperty Content

Add-Content
Add-Member
Add-Type
Add-Zip
Append-ADUserAccountControl
ConvertTo-HashArray
ConvertTo-Html
...
Start-sleep
Test-Path
Where
write-error
Write-Output
Write-Verbose
Write-Warning

You can even analyze the frequency of how often commands were used. This gets you the 10 most-often used commands:

PS> $result.Tokens | 
  Where-Object Type -eq Command | 
  Select-Object -ExpandProperty Content |
  Group-Object -NoElement |
  Sort-Object -Property Count -Descending |
  Select-Object -First 10

Count Name                     
----- ----                     
   51 Search-AD                
   49 New-Object               
   35 Write-Verbose            
   29 get-date                 
   25 %                        
   24 New-TimeSpan             
   24 Where                    
   21 select                   
   19 Sort-Object              
   17 Invoke-Method            

Analyzing Use of .NET Methods

Maybe you are interested in finding out which native .NET methods the script uses. Again, it is just a matter of token filtering:

PS> $result.Tokens | 
  Where-Object Type -eq Member | 
  Select-Object -ExpandProperty Content |
  Sort-Object -Unique
  
Accessible
ActiveSheet
Add
AdjacentSites
adminDisplayName
...
whenchanged
whencreated
Workbooks
Worksheets

At this point, you are reaching the limit of token analysis:

While it is nice to get a list of method names used by a script, it is not really useful. You’d need a bigger picture to know the object types the called methods belong to. All this is possible, too, but not with tokens alone. What’s required is a look at script structures that consist of multiple tokens - a case for the Abstract Syntax Tree (AST) which we shed light on in one of the next parts of this series.

Bulk-Analysis: Scanning Entire Folders

Test-PSOneScript can’t just examine one file at a time. It is fully pipeline-aware and knows how to deal with files returned by Get-ChildItem.

Finding Scripts With Errors

So if you want to identify scripts with syntax errors anywhere in your script library, simply runGet-ChildItem to gather the files to be tested, and pipe them to Test-PSOneScript like this:

# get all PowerShell files from your user profile...
Get-ChildItem -Path $home -Recurse -Include *.ps1, *.psd1, *.psm1 -File |
  # ...parse them...
  Test-PSOneScript |
  # ...filter those with syntax errors...
  Where-Object Errors |
  # ...expose the errors:
  ForEach-Object {
    [PSCustomObject]@{
      Name = $_.Name
      Error = $_.Errors[0].Message
      Type = $_.Errors[0].Type
      Line = $_.Errors[0].StartLine
      Column = $_.Errors[0].StartColumn
      Path = $_.Path
    }
  }

This will find any script with any syntax error. If you’d like to be more specific, you can filter on the error message.

Identifying Risky Commands

The sky is the limit, so if you’d like to identify scripts that use risky commands such as Invoke-Expression, just adjust the filter:

$blacklist = @('Invoke-Expression', 'Stop-Computer', 'Restart-Computer')


# get all PowerShell files from your user profile...
Get-ChildItem -Path $home -Recurse -Include *.ps1, *.psd1, *.psm1 -File |
# ...parse them...
Test-PSOneScript |
# ...filter those using commands in our blacklist
Foreach-Object {
  # get the first token that is a command and that is in our blacklist
  $badToken = $_.Tokens.Where{$_.Type -eq 'Command'}.Where{$_.Content -in $blacklist} | 
    Select-Object -First 1
  
  if ($badToken)
  {
    $_ | Add-Member -MemberType NoteProperty -Name BadToken -Value $badToken -PassThru
  }
  } |
  # ...expose the errors:
  ForEach-Object {
    [PSCustomObject]@{
      Name = $_.Name
      Offender = $_.BadToken.Content
      Line = $_.BadToken.StartLine
      Column = $_.BadToken.StartColumn
      Path = $_.Path
    }
  }

What’s Next

Using thePSParser is just your first step into the wonderful world of tokens and script analysis. In the next part we’ll take a look at the more sophisticated Parser object which was introduced in PowerShell 3 and differentiates 150 different token kinds plus 26 token flags.

And if that’s still not enough detail, we look into the Abstract Syntax Tree (AST) and how it forms meaningful structures from a group of token.

BTW, have you checked out PowerShell Conference EU yet? Both Call for Papers and Delegate Registration are open!

NEXTAdvanced Tokenizer