Plain text is a common data format. Many console commands emit plain text, and frequently log files use plain text and organize data in their own formats.
Even though plain text - unlike structured data such as json or xml - has no intrinsic formatting, in the majority of cases the plain text content is organized in tabular format: one line per dataset, and each dataset items separated by a delimiter or fixed width:
In this article, I’ll look at a number of use cases and illustrate how you can quickly parse raw text that is organized in a tabular way.
Automatically Parsing Raw Tabular Text
Whenever your plain text data is structured like this, you can use PowerShells powerful built-in parser ConvertFrom-Csv
. This cmdlet was originally intended to parse pure csv data but in reality, it can be adjusted to parse just about any raw text that uses tabular design:
- -Delimiter:
ConvertFrom-Csv
uses a comma as default delimiter but you can override this and choose any delimiter you want, so you can also parse text that uses delimiters such as ; or tabs. The only restriction: a delimiter must be a single character. - -Header: The first line defines the headers (property names). If your raw text comes with no headers, you can add your own. And to rename headers to your liking, just remove the first line and use your own headers instead.
- Fixed-Width Columns: Even if your raw text does not have a delimiter and instead uses fixed-width columns, you can still use
ConvertFrom-Csv
. With a simple trick, you can convert fixed-width columns to csv.
Reading Text Files: Encoding
Import-Csv
is a close relative that you can use in place of ConvertTo-Csv
whenever you need to read data from text files.
In essence, Import-Csv
is the combination of Get-Content | ConvertTo-Csv
. Whenever you read text information from files, keep an eye on the encoding or else you may end up with deranged special characters. Import-Csv
and Get-Content
both support the parameter -Encoding.
The arguments UTF8 and Default are your best bets here. -Encoding Default is not the default and in fact is not an encoding flavor at all. Default instructs PowerShell to figure out the encoding dynamically. This is the best approach unless of course the encoding can’t be identified. That’s when falling back to the most commonly used encoding UTF8 is the next-best approach.
Creating Objects From Text
To get familiar with the csv parser, first take a look at this simple example:
# raw tabular text data:
$data = @'
Name,Id,Code
Tobi,488,DE
Jeffrey,512,US
Rob,923,UK
Alex,726,SR
'@
# convert plain text:
$result = $data | ConvertFrom-Csv
# emit results:
$result
The raw text piped into ConvertFrom-Csv
is using a header row and a delimiter. You get back a series of objects:
Name Id Code
---- -- ----
Tobi 488 DE
Jeffrey 512 US
Rob 923 UK
Alex 726 SR
In fact, using
ConvertTo-Csv
like this can be a useful trick to quickly generate new objects.
Since they are objects, you can now do whatever you’d like to, for example sort or filter the objects:
$result | Sort-Object -Property Name | Where-Object Code -ne 'DE' | Out-GridView
Renaming Headers
If you don’t like the original header names, simply remove the first line of your input data and submit your own headers.
In real-world scenarios, console commands and commands like Get-Content
provide you with a stream of lines, so I converted $data
into a string array of lines as well. Now it’s simple to use Select-Object -Skip 1
to remove the header line, and replace it with your own headers:
# raw tabular text data:
$data = @'
Name,Id,Code
Tobi,488,DE
Jeffrey,512,US
Rob,923,UK
Alex,726,SR
'@
# pretend the input data is a series of lines (that's what you
# get from Get-Content, Import-CSV, or command line tools):
$lines = $data.Split("`n")
# process each line, and run Trim() to remove remaining line feeds that
# were left over from splitting:
$result = $lines.Trim() |
# remove first line
Select-Object -Skip 1 |
# convert and use custom headers:
ConvertFrom-Csv -Header 'PowerSheller', 'Code', 'Country'
# emit results:
$result
That’s what the result looks like now:
PowerSheller Code Country
------------ ---- -------
Tobi 488 DE
Jeffrey 512 US
Rob 923 UK
Alex 726 SR
Using Alternative Delimiters
If your raw text uses a delimiter different than a comma, simply specify it. That’s really all.
Let’s assume the text uses tabs as delimiters:
# raw tabular text data using TAB as delimiter:
$data = @"
Name`tId`tCode
Tobi`t488`tDE
Jeffrey`t512`tUS
Rob`t923`tUK
Alex`t726`tSR
"@
# convert plain text:
$result = $data | ConvertFrom-Csv -Delimiter "`t"
# emit results:
$result
As long as you specify the correct delimiter, the raw text is parsed like before.
Using Fixed-Width Columns
One of the most challenging input formats is text with fixed column widths. Most of the time, that’s when custom parsers are required that take into account the column widths.
However, often you can get away with converting such input using csv format:
# raw tabular text data using fixed width columns:
$data = @"
Name Id Code
Tobi 488 DE
Jeffrey 512 US
Rob 923 UK
Alex 726 SR
"@
# replace 2 or more spaces with ONE comma:
$dataNew = $data -replace '[ ]{2,}',','
# convert plain text:
$result = $dataNew | ConvertFrom-Csv
Key here is a Regular Expression that identifies more than one space and replaces it with exactly one comma, essentially turning a fixed-width design into a csv design.
- [ ]: represents a space. Using \s instead would also identify tabs and line breaks.
- {2,}: looks for 2 or more instances, effectively replacing two or more spaces
This simple approach works extremely well in many scenarios:
# take output from qprocess.exe and turn into objects
(qprocess) -replace '[ ]{2,}',',' | ConvertFrom-Csv | Out-GridView
Limitations
It does have important limitations that you need to consider because we actually treat two or more spaces as our new delimiter. So whenever columns are no longer separated by two or more spaces (maybe there is just one space or none at all), or whenever the content of a column also contains two or more spaces, the parser gets confused.
Actually, I used two or more spaces over one or more spaces to at least eliminate the conflict with single spaces in column content.
Using RegEx-Based Parsing
This article is about using the built-in csv parser, not about Regular Expressions. Regular Expressions is a different story that I’ll tell in a different article.
However when your input data runs into conflicts like described above. it’s time to create own parser logic. Regular Expressions can be very helpful, so I’d like to illustrate how a Regular Expression pattern can parse a fixed-width raw text input.
To make things harder, below I used raw text data that fills the entire column and has no delimiting spaces whatsoever (see Jeffrey), plus added column content that uses one or two spaces (see column Remarks), plus left out data for some datasets (see Rob and Alex) - a situation where you can no longer use the simple two or more spaces delimiter approach:
# raw tabular text data using fixed width columns:
$data = @"
Name Id Remark
Tobi 488 contains spaces
Jeffrey512 contains two spaces
Rob 923
Alex 726
"@
# define header names and column widths as a regex pattern:
# Name: 7 chars
# Id: 6 chars
# Remark: 1 or more
$pattern = '^(?<Name>.{7}){0,1}(?<Id>.{6}){0,1}(?<Remark>.{1,}){0,1}$'
# split data into lines, and process:
$lines = $data.Split("`n")
$lines.Trim() |
# skip header line
Select-Object -Skip 1 |
# parse each line
ForEach-Object {
if ($_ -match $pattern)
{
# remove full match, leaving only groups:
$matches.Remove(0)
# convert to object
[PSCustomObject]$matches
}
} |
# put properties in desired order
Select-Object -Property Name, Id, Remark
When you run the code, the raw text is correctly turned into objects:
Name Id Remark
---- -- ------
Tobi 488 contains spaces
Jeffrey 512 contains two spaces
Rob 923
Alex 726
Key is this Regular Expression: ’^(?«Name».{7}){0,1}(?«Id».{6}){0,1}(?«Remark».{1,}){0,1}$’.
Essentially, it defines the headers: choose the name in angle brackets, and specify the column width in curly braces.
The quantifier {0,1} makes sure the pattern still matches when there are data sets with missing content. For example, the datasets Rob and Alex do not provide information for Remark.
The remainder of the code removes the original headers and then matches each line with the pattern. The result is a hashtable which can easily be turned into objects. Since regular hashtables do not preserve the order of properties, a final Select-Object
can arrange the properties in any way you like.
Parsing Real World Csv Data
Let’s apply the stuff to some real-world examples. I’ll be using the tool driverquery.exe
to produce csv output. This is just for illustration. You can of course use any command that produces csv output, or read csv data from files.
Like many command line tools, driverquery.exe
supports the parameter /FO (format output) and can output driver information in csv format:
driverquery /FO:CSV
The result looks similar to this:
"Module Name","Display Name","Driver Type","Link Date"
"1394ohci","1394 OHCI Compliant Host Controller","Kernel ",""
"3ware","3ware","Kernel ","19.05.2015 00:28:03"
"ACPI","Microsoft ACPI Driver","Kernel ",""
"AcpiDev","ACPI Devices driver","Kernel ",""
"acpiex","Microsoft ACPIEx Driver","Kernel ",""
"acpipagr","ACPI Processor Aggregator Driver","Kernel ",""
"AcpiPmi","ACPI Power Meter Driver","Kernel ",""
"acpitime","ACPI Wake Alarm Driver","Kernel ",""
"Acx01000","Acx01000","Kernel ",""
...
Take a close look at the raw data and watch out for caveats!
The values for the column “Driver Type” apparently append the value with one space. You may have to consider this when you later filter values.
Converting Csv To Objects
To turn the raw csv output into useful objects, run it through ConvertFrom-Csv
:
driverquery /FO:CSV | ConvertFrom-Csv
To read csv data directly from a file, use
Import-Csv
instead.
The result are true objects that you can process like any other object:
Module Name Display Name Driver Type Link Date
----------- ------------ ----------- ---------
1394ohci 1394 OHCI Compliant Host Controller Kernel
3ware 3ware Kernel 19.05.2015 00:28:03
ACPI Microsoft ACPI Driver Kernel
AcpiDev ACPI Devices driver Kernel
acpiex Microsoft ACPIEx Driver Kernel
acpipagr ACPI Processor Aggregator Driver Kernel
AcpiPmi ACPI Power Meter Driver Kernel
acpitime ACPI Wake Alarm Driver Kernel
Acx01000 Acx01000 Kernel
ADP80XX ADP80XX Kernel 09.04.2015 22:49:48
...
Polishing Raw Data
As you may have discovered earlier by looking at the raw data, driverquery.exe
appends a space to the driver types. Spaces may make it harder later when you want to filter values.
If you like, you can easily clean up any issues you identified, though:
# remove all leading and trailing spaces from property 'Driver Type':
driverquery /FO:CSV | ConvertFrom-Csv | Foreach-Object { $_.'Driver Type' = $_.'Driver Type'.Trim(); $_}
When you use Foreach-Object
to polish individual objects, always make sure you return the changed object. That’s why the code emits $_
at the end of the scriptblock.
Filtering And Sorting Results
Once ConvertFrom-Csv
converted the csv text to objects, you can now use all the common pipeline cmdlets to filter, sort, and select the information you need.
To get a sorted list of filesystem drivers, simply use the typical pipeline cmdlets:
# produce sorted list of file system drivers:
driverquery /FO:CSV | ConvertFrom-Csv | Foreach-Object { $_.'Driver Type' = $_.'Driver Type'.Trim(); $_} | Where-Object 'Driver Type' -eq 'File System' | Sort-Object -Property 'Display Name' | Select-Object -Property 'Display Name', 'Module Name', 'Link Date' | Out-GridView
Renaming Properties
Since you are starting with well-formatted tabular raw text data, you don’t need to live with the pre-defined property names. After all, these are defined in the first row of the csv file. By replacing this first line before converting the text, you can rename properties to anything you like.
This example illustrates how you remove the first line and replace your own list of column headers:
# rename column headers to custom list
$newHeaders = 'Module','Name','Type','Date'
# replace column headers:
driverquery /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeaders
Select-Object -Skip 1
removes the first line. ConvertFrom-Csv
(and Import-Csv
) support the parameter -Header that lets you define your own headers.
If you come across csv data with no headers, you can use -Header to easily add headers.
Creating System Reports
The strategies outlined above can be applied to any kind of csv information. For example, the tool systeminfo.exe
collects useful information about your computer system and by default outputs it as plain text.
Simply ask systeminfo.exe
to output the data in csv format instead, and apply the tricks above to get real objects:
systeminfo.exe /FO:CSV | ConvertFrom-Csv | Out-GridView
When you run this line from inside the ISE editor, you may run into encoding problems because the ISE editor uses a hidden console to run command line tools. This issue is not present when you run the command in a real PowerShell console.
Standardizing Column Names
The default column names produced by systeminfo.exe
are problematic for two reasons:
- Spaces: The column names frequently contain spaces which makes it harder to access these properties later because you would have to quote the property names
- Localization: The entire output of command line tools typically is localized, so when you run it on a US system, both column names and (some) content are English whereas on German systems, they would be - correct, German. While this may be tolerable with the actual content, column (property) names should always be identical. Else, when you collect information from different systems with different locales, PowerShell wouldn’t know how to create tables from it.
That’s why you should use the simple trick from above and standardize the column names. First, extract the original column names like this:
systeminfo.exe /FO:CSV | Select-Object -First 1 | Set-ClipBoard
Now, the first line of the csv output is available in your clipboard. Next, paste this line into a script, edit the column names in any way you like, and use the edited list of column names in place of the original ones.
You can for example translate property names and remove any special characters like spaces or parenthesis:
# list of edited properties:
$newHeader = "HostName","OSName","OSVersion","OSManufacturer","OSConfiguration","OSBuildType","RegisteredOwner","RegisteredOrganization","ProductID","OriginalInstallDate","SystemBootTime","SystemManufacturer","SystemModel","SystemType","Processor","BIOSVersion","WindowsDirectory","SystemDirectory","BootDevice","SystemLocale","Input Locale","TimeZone","TotalPhysicalMemory","AvailablePhysicalMemory","VirtualMemoryMaxSize","VirtualMemoryAvailable","VirtualMemoryInUse","PageFileLocation","Domain","LogonServer","Hotfix","NetworkCard","HyperVRequirements"
# get system info and convert to objects with standardized properties:
$info = systeminfo.exe /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeader
# output selected information:
$info.TimeZone
$info.TotalPhysicalMemory
$info.OSVersion
$info | Out-GridView
Now the information produced by the code can be accessed via the standardized property names, regardless of the locale the computer uses.
Get-MySystemInfo
While at it, in a last step let’s bake all of this into a brand new PowerShell function like Get-MySystemInfo
and add support for remoting so you can retrieve the information locally and remotely, and even in parallel:
function Get-MySystemInfo
{
param
(
# one or more computer names. When omitted, the local computer is used
# Remote machines MUST have remoting enabled, and you need appropriate permissions
[string[]]
$ComputerName,
[PSCredential]
$Credential
)
# code to execute locally or remotely:
$code = {
# list of edited properties:
$newHeader = "HostName","OSName","OSVersion","OSManufacturer","OSConfiguration","OSBuildType","RegisteredOwner","RegisteredOrganization","ProductID","OriginalInstallDate","SystemBootTime","SystemManufacturer","SystemModel","SystemType","Processor","BIOSVersion","WindowsDirectory","SystemDirectory","BootDevice","SystemLocale","Input Locale","TimeZone","TotalPhysicalMemory","AvailablePhysicalMemory","VirtualMemoryMaxSize","VirtualMemoryAvailable","VirtualMemoryInUse","PageFileLocation","Domain","LogonServer","Hotfix","NetworkCard","HyperVRequirements"
# get system info and convert to objects with standardized properties:
systeminfo.exe /FO:CSV | Select-Object -Skip 1 | ConvertFrom-Csv -Header $newHeader
}
# invoke the code locally or remotely, and use splatting to forward parameters only when the
# user actually specified them
# (this way you can run the code locally w/o administrator privileges and remoting prerequisites)
Invoke-Command -ScriptBlock $code @PSBoundParameters
}
To get the local system information, use this:
Get-MySystemInfo
To query one or more remote systems, try this:
Get-MySystemInfo -ComputerName 'server1', 'server2', 'server3'
And to authenticate with a different user account, use the parameter -Credential.
Remoting automatically adds the property PSComputerName to your results so when you query more than one system, you can always identify the machine that emitted the data.
The function uses the default PowerShell remoting. In order to access remote systems, these must have remoting turned on, and you need to have the proper permissions to access them.
Since remoting works in parallel, you can query an entire list of servers and get back the results almost instantaneously.
Servers by default have remoting enabled, clients do not. The default authentication method is Kerberos so you need to specify the computer names, not their IP addresses. If you must use IP addresses and/or access systems outside a domain, make sure you enable NTLM authentication on the client side.