Parsing Text - powershell.one

Plain text is a common data format. Many console commands emit plain text, and frequently log files use plain text and organize data in their own formats.

Even though plain text - unlike structured data such as json or xml - has no intrinsic formatting, in the majority of cases the plain text content is organized in tabular format: one line per dataset, and each dataset items separated by a delimiter or fixed width:

Table Format

In this article, I’ll look at a number of use cases and illustrate how you can quickly parse raw text that is organized in a tabular way.

Automatically Parsing Raw Tabular Text

Whenever your plain text data is structured like this, you can use PowerShells powerful built-in parser ConvertFrom-Csv. This cmdlet was originally intended to parse pure csv data but in reality, it can be adjusted to parse just about any raw text that uses tabular design:

-Delimiter: ConvertFrom-Csv uses a comma as default delimiter but you can override this and choose any delimiter you want, so you can also parse text that uses delimiters such as ; or tabs. The only restriction: a delimiter must be a single character.
-Header: The first line defines the headers (property names). If your raw text comes with no headers, you can add your own. And to rename headers to your liking, just remove the first line and use your own headers instead.
Fixed-Width Columns: Even if your raw text does not have a delimiter and instead uses fixed-width columns, you can still use ConvertFrom-Csv. With a simple trick, you can convert fixed-width columns to csv.

Reading Text Files: Encoding

Import-Csv is a close relative that you can use in place of ConvertTo-Csv whenever you need to read data from text files.

In essence, Import-Csv is the combination of Get-Content | ConvertTo-Csv. Whenever you read text information from files, keep an eye on the encoding or else you may end up with deranged special characters. Import-Csv and Get-Content both support the parameter -Encoding.

The arguments UTF8 and Default are your best bets here. -Encoding Default is not the default and in fact is not an encoding flavor at all. Default instructs PowerShell to figure out the encoding dynamically. This is the best approach unless of course the encoding can’t be identified. That’s when falling back to the most commonly used encoding UTF8 is the next-best approach.

Creating Objects From Text

To get familiar with the csv parser, first take a look at this simple example:

# raw tabular text data:
$data = @'
Name,Id,Code
Tobi,488,DE
Jeffrey,512,US
Rob,923,UK
Alex,726,SR
'@ 

# convert plain text:
$result = $data | ConvertFrom-Csv

# emit results:
$result

The raw text piped into ConvertFrom-Csv is using a header row and a delimiter. You get back a series of objects:

Name    Id  Code
----    --  ----
Tobi    488 DE
Jeffrey 512 US
Rob     923 UK
Alex    726 SR

In fact, using ConvertTo-Csv like this can be a useful trick to quickly generate new objects.

Since they are objects, you can now do whatever you’d like to, for example sort or filter the objects:

$result | Sort-Object -Property Name | Where-Object Code -ne 'DE' | Out-GridView

Renaming Headers

If you don’t like the original header names, simply remove the first line of your input data and submit your own headers.

In real-world scenarios, console commands and commands like Get-Content provide you with a stream of lines, so I converted $data into a string array of lines as well. Now it’s simple to use Select-Object -Skip 1 to remove the header line, and replace it with your own headers:

# raw tabular text data:
$data = @'
Name,Id,Code
Tobi,488,DE
Jeffrey,512,US
Rob,923,UK
Alex,726,SR
'@ 

# pretend the input data is a series of lines (that's what you
# get from Get-Content, Import-CSV, or command line tools):
$lines = $data.Split("`n")

# process each line, and run Trim() to remove remaining line feeds that
# were left over from splitting:
$result = $lines.Trim() | 
  # remove first line
  Select-Object -Skip 1 |
  # convert and use custom headers:
  ConvertFrom-Csv -Header 'PowerSheller', 'Code', 'Country'

# emit results:
$result

That’s what the result looks like now:

PowerSheller Code Country
------------ ---- -------
Tobi         488  DE     
Jeffrey      512  US     
Rob          923  UK     
Alex         726  SR 

Using Alternative Delimiters

If your raw text uses a delimiter different than a comma, simply specify it. That’s really all.

Let’s assume the text uses tabs as delimiters:

# raw tabular text data using TAB as delimiter:
$data = @"
Name`tId`tCode
Tobi`t488`tDE
Jeffrey`t512`tUS
Rob`t923`tUK
Alex`t726`tSR
"@
 

# convert plain text:
$result = $data | ConvertFrom-Csv -Delimiter "`t"

# emit results:
$result

As long as you specify the correct delimiter, the raw text is parsed like before.

Using Fixed-Width Columns

One of the most challenging input formats is text with fixed column widths. Most of the time, that’s when custom parsers are required that take into account the column widths.

However, often you can get away with converting such input using csv format:

# raw tabular text data using fixed width columns:
$data = @"
Name     Id    Code
Tobi     488   DE
Jeffrey  512   US
Rob      923   UK
Alex     726   SR
"@
 
# replace 2 or more spaces with ONE comma:
$dataNew = $data -replace '[ ]{2,}',','

# convert plain text:
$result = $dataNew | ConvertFrom-Csv

Key here is a Regular Expression that identifies more than one space and replaces it with exactly one comma, essentially turning a fixed-width design into a csv design.

[ ]: represents a space. Using \s instead would also identify tabs and line breaks.
{2,}: looks for 2 or more instances, effectively replacing two or more spaces

This simple approach works extremely well in many scenarios:

# take output from qprocess.exe and turn into objects
(qprocess) -replace '[ ]{2,}',',' | ConvertFrom-Csv | Out-GridView

Limitations

It does have important limitations that you need to consider because we actually treat two or more spaces as our new delimiter. So whenever columns are no longer separated by two or more spaces (maybe there is just one space or none at all), or whenever the content of a column also contains two or more spaces, the parser gets confused.

Actually, I used two or more spaces over one or more spaces to at least eliminate the conflict with single spaces in column content.

Using RegEx-Based Parsing

This article is about using the built-in csv parser, not about Regular Expressions. Regular Expressions is a different story that I’ll tell in a different article.

However when your input data runs into conflicts like described above. it’s time to create own parser logic. Regular Expressions can be very helpful, so I’d like to illustrate how a Regular Expression pattern can parse a fixed-width raw text input.

To make things harder, below I used raw text data that fills the entire column and has no delimiting spaces whatsoever (see Jeffrey), plus added column content that uses one or two spaces (see column Remarks), plus left out data for some datasets (see Rob and Alex) - a situation where you can no longer use the simple two or more spaces delimiter approach:

# raw tabular text data using fixed width columns:
$data = @"
Name   Id    Remark
Tobi   488   contains spaces
Jeffrey512   contains  two  spaces
Rob    923
Alex   726
"@
 
# define header names and column widths as a regex pattern:
# Name:   7 chars
# Id:     6 chars
# Remark: 1 or more
$pattern = '^(?<Name>.{7}){0,1}(?<Id>.{6}){0,1}(?<Remark>.{1,}){0,1}$'

# split data into lines, and process:
$lines = $data.Split("`n") 
$lines.Trim() |
  # skip header line
  Select-Object -Skip 1 |
  # parse each line
  ForEach-Object {
    if ($_ -match $pattern)
    {
      # remove full match, leaving only groups:
      $matches.Remove(0)
      # convert to object
      [PSCustomObject]$matches
    }
  } |
  # put properties in desired order
  Select-Object -Property Name, Id, Remark

When you run the code, the raw text is correctly turned into objects:

Name    Id     Remark               
----    --     ------               
Tobi    488    contains spaces      
Jeffrey 512    contains  two  spaces
Rob            923                  
Alex           726 

Key is this Regular Expression: ’^(?«Name».{7}){0,1}(?«Id».{6}){0,1}(?«Remark».{1,}){0,1}$’.

Essentially, it defines the headers: choose the name in angle brackets, and specify the column width in curly braces.

The quantifier {0,1} makes sure the pattern still matches when there are data sets with missing content. For example, the datasets Rob and Alex do not provide information for Remark.

The remainder of the code removes the original headers and then matches each line with the pattern. The result is a hashtable which can easily be turned into objects. Since regular hashtables do not preserve the order of properties, a final Select-Object can arrange the properties in any way you like.

Parsing Real World Csv Data

Let’s apply the stuff to some real-world examples. I’ll be using the tool driverquery.exe to produce csv output. This is just for illustration. You can of course use any command that produces csv output, or read csv data from files.

Like many command line tools, driverquery.exe supports the parameter /FO (format output) and can output driver information in csv format:

driverquery /FO:CSV

The result looks similar to this:

"Module Name","Display Name","Driver Type","Link Date"
"1394ohci","1394 OHCI Compliant Host Controller","Kernel ",""
"3ware","3ware","Kernel ","19.05.2015 00:28:03"
"ACPI","Microsoft ACPI Driver","Kernel ",""
"AcpiDev","ACPI Devices driver","Kernel ",""
"acpiex","Microsoft ACPIEx Driver","Kernel ",""
"acpipagr","ACPI Processor Aggregator Driver","Kernel ",""
"AcpiPmi","ACPI Power Meter Driver","Kernel ",""
"acpitime","ACPI Wake Alarm Driver","Kernel ",""
"Acx01000","Acx01000","Kernel ",""
...

Take a close look at the raw data and watch out for caveats!

The values for the column “Driver Type” apparently append the value with one space. You may have to consider this when you later filter values.

Converting Csv To Objects

To turn the raw csv output into useful objects, run it through ConvertFrom-Csv:

driverquery /FO:CSV | ConvertFrom-Csv 

To read csv data directly from a file, use Import-Csv instead.

The result are true objects that you can process like any other object:

Module Name Display Name                        Driver Type Link Date
----------- ------------                        ----------- ---------
1394ohci    1394 OHCI Compliant Host Controller Kernel
3ware       3ware                               Kernel      19.05.2015 00:28:03
ACPI        Microsoft ACPI Driver               Kernel
AcpiDev     ACPI Devices driver                 Kernel
acpiex      Microsoft ACPIEx Driver             Kernel
acpipagr    ACPI Processor Aggregator Driver    Kernel
AcpiPmi     ACPI Power Meter Driver             Kernel
acpitime    ACPI Wake Alarm Driver              Kernel
Acx01000    Acx01000                            Kernel
ADP80XX     ADP80XX                             Kernel      09.04.2015 22:49:48
...

Polishing Raw Data

As you may have discovered earlier by looking at the raw data, driverquery.exe appends a space to the driver types. Spaces may make it harder later when you want to filter values.

If you like, you can easily clean up any issues you identified, though:

# remove all leading and trailing spaces from property 'Driver Type':
driverquery /FO:CSV | ConvertFrom-Csv | Foreach-Object { $_.'Driver Type' = $_.'Driver Type'.Trim(); $_}

When you use Foreach-Object to polish individual objects, always make sure you return the changed object. That’s why the code emits $_ at the end of the scriptblock.

Filtering And Sorting Results

Once ConvertFrom-Csv converted the csv text to objects, you can now use all the common pipeline cmdlets to filter, sort, and select the information you need.

To get a sorted list of filesystem drivers, simply use the typical pipeline cmdlets:

# produce sorted list of file system drivers:
driverquery /FO:CSV | ConvertFrom-Csv | Foreach-Object { $_.'Driver Type' = $_.'Driver Type'.Trim(); $_} | Where-Object 'Driver Type' -eq 'File System' | Sort-Object -Property 'Display Name' | Select-Object -Property 'Display Name', 'Module Name', 'Link Date' | Out-GridView

Renaming Properties

Since you are starting with well-formatted tabular raw text data, you don’t need to live with the pre-defined property names. After all, these are defined in the first row of the csv file. By replacing this first line before converting the text, you can rename properties to anything you like.

This example illustrates how you remove the first line and replace your own list of column headers:

# rename column headers to custom list
$newHeaders = 'Module','Name','Type','Date'

# replace column headers:
driverquery /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeaders

Select-Object -Skip 1 removes the first line. ConvertFrom-Csv (and Import-Csv) support the parameter -Header that lets you define your own headers.

If you come across csv data with no headers, you can use -Header to easily add headers.

Creating System Reports

The strategies outlined above can be applied to any kind of csv information. For example, the tool systeminfo.exe collects useful information about your computer system and by default outputs it as plain text.

Simply ask systeminfo.exe to output the data in csv format instead, and apply the tricks above to get real objects:

systeminfo.exe /FO:CSV | ConvertFrom-Csv | Out-GridView

When you run this line from inside the ISE editor, you may run into encoding problems because the ISE editor uses a hidden console to run command line tools. This issue is not present when you run the command in a real PowerShell console.

Standardizing Column Names

The default column names produced by systeminfo.exe are problematic for two reasons:

Spaces: The column names frequently contain spaces which makes it harder to access these properties later because you would have to quote the property names
Localization: The entire output of command line tools typically is localized, so when you run it on a US system, both column names and (some) content are English whereas on German systems, they would be - correct, German. While this may be tolerable with the actual content, column (property) names should always be identical. Else, when you collect information from different systems with different locales, PowerShell wouldn’t know how to create tables from it.

That’s why you should use the simple trick from above and standardize the column names. First, extract the original column names like this:

systeminfo.exe /FO:CSV | Select-Object -First 1 | Set-ClipBoard

Now, the first line of the csv output is available in your clipboard. Next, paste this line into a script, edit the column names in any way you like, and use the edited list of column names in place of the original ones.

You can for example translate property names and remove any special characters like spaces or parenthesis:

# list of edited properties:
$newHeader = "HostName","OSName","OSVersion","OSManufacturer","OSConfiguration","OSBuildType","RegisteredOwner","RegisteredOrganization","ProductID","OriginalInstallDate","SystemBootTime","SystemManufacturer","SystemModel","SystemType","Processor","BIOSVersion","WindowsDirectory","SystemDirectory","BootDevice","SystemLocale","Input Locale","TimeZone","TotalPhysicalMemory","AvailablePhysicalMemory","VirtualMemoryMaxSize","VirtualMemoryAvailable","VirtualMemoryInUse","PageFileLocation","Domain","LogonServer","Hotfix","NetworkCard","HyperVRequirements"

# get system info and convert to objects with standardized properties:
$info = systeminfo.exe /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeader

# output selected information:
$info.TimeZone
$info.TotalPhysicalMemory
$info.OSVersion
$info | Out-GridView

Now the information produced by the code can be accessed via the standardized property names, regardless of the locale the computer uses.

Get-MySystemInfo

While at it, in a last step let’s bake all of this into a brand new PowerShell function like Get-MySystemInfo and add support for remoting so you can retrieve the information locally and remotely, and even in parallel:

function Get-MySystemInfo
{
  param
  (
    # one or more computer names. When omitted, the local computer is used
    # Remote machines MUST have remoting enabled, and you need appropriate permissions
    [string[]]
    $ComputerName,
    
    [PSCredential]
    $Credential
  )
  
  # code to execute locally or remotely:
  $code = {
    # list of edited properties:
    $newHeader = "HostName","OSName","OSVersion","OSManufacturer","OSConfiguration","OSBuildType","RegisteredOwner","RegisteredOrganization","ProductID","OriginalInstallDate","SystemBootTime","SystemManufacturer","SystemModel","SystemType","Processor","BIOSVersion","WindowsDirectory","SystemDirectory","BootDevice","SystemLocale","Input Locale","TimeZone","TotalPhysicalMemory","AvailablePhysicalMemory","VirtualMemoryMaxSize","VirtualMemoryAvailable","VirtualMemoryInUse","PageFileLocation","Domain","LogonServer","Hotfix","NetworkCard","HyperVRequirements"
  
    # get system info and convert to objects with standardized properties:
    systeminfo.exe /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeader
  }
  
  # invoke the code locally or remotely, and use splatting to forward parameters only when the
  # user actually specified them 
  # (this way you can run the code locally w/o administrator privileges and remoting prerequisites)
  Invoke-Command -ScriptBlock $code @PSBoundParameters 
}

To get the local system information, use this:

Get-MySystemInfo

To query one or more remote systems, try this:

Get-MySystemInfo -ComputerName 'server1', 'server2', 'server3'

And to authenticate with a different user account, use the parameter -Credential.

Remoting automatically adds the property PSComputerName to your results so when you query more than one system, you can always identify the machine that emitted the data.

The function uses the default PowerShell remoting. In order to access remote systems, these must have remoting turned on, and you need to have the proper permissions to access them.

Since remoting works in parallel, you can query an entire list of servers and get back the results almost instantaneously.

Servers by default have remoting enabled, clients do not. The default authentication method is Kerberos so you need to specify the computer names, not their IP addresses. If you must use IP addresses and/or access systems outside a domain, make sure you enable NTLM authentication on the client side.

PREVIOUSMake Loops Stream

NEXTExcel