Regular Expressions can describe almost any text-based pattern but unfortunately they are black magic to many, and look like klingon monsters.
That’s because they were invented a long time ago in a text-based world. When you try and verify type-based expressions, they quickly become ugly and complex.
However, there are a couple of clever strategies helping you to compose complex Regular Expression patterns in no time. In fact, when you follow these strategies you can quickly start to write hugely complex Regular Expressions without being a RegEx Guru at all.
Let’s look at this step-by-step, with a real-world example. Let’s assume you need to make sure that user input is in fact a valid IPv4 IP address. How would you do this?
RegEx Is Character-Based
If you are intrigued to use the almighty Regular Expressions to tackle this problem, there’s one big hurdle to take:
Regular Expression patterns are strictly character-based, so the pattern checks one character after another. Sure, you can use quantifiers to define how often a character may repeat, so this defines a digit (\d) that can repeat a maximum of 3 times (a number of one to three digits):
\d{1,3}
But this covers all digits with one to three numbers and is completely useless to describe a valid IP address. There is no way for you to use simple type-based entities such as bytes: Regular Expressions break down and describe a byte purely character by character and don’t know about constructs that consists of more than one character (like a byte).
Type-Based Patterns Are Monsters
When you try and match type-based expressions, the character-based nature of Regular Expressions easily creates hugely insane monsters. This is a Regular Expression that matches IPv4 IP addresses:
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Once you know this magic spell, you can easily use it to validate IPv4 addresses in PowerShell, for example:
$ipv4 = '^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
do
{
$ip = Read-Host 'Enter IPv4'
} while ($ip -notmatch $ipV4)
"Hurray, $ip is a valid IPv4 address!"
And of course there is nothing wrong about googling for regular expressions and using them. Then again: you know from googled code how dangerous and inflexible it is when you don’t fully understand such code. It may do what you want at first but in the long run, it may turn out to behave much differently than you assumed.
So it is a good idea for any professional PowerShell scripter to know what you do and code before you run or publish it. Let’s get a grip on Regular Expressions for good.
Using Types In RegEx Patterns
How much easier would Regular Expressions be if only they were modern and supported type-based patterns. If there was a placeholder for bytes, for example \BYTE, the monster pattern from above would shrink-fit to pet size:
^\BYTE\.\BYTE\.\BYTE\.\BYTE$
^ and $ mark the begin and end of a text. If you omit these, the pattern would still work but also accept larger texts that contain the IPv4 address.
Coincidentally, PowerShell can add exactly this simple and effective type-awareness to Regular Expressions . In fact, the ancient Regular Expression engine doesn’t even need to know about it. It is you that can use a different strategy to design your Regular Expression patterns.
Defining A “Byte”
Simply break down patterns into logical blocks. To define a byte, assign the pattern for it to a variable:
$byte = '(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
(?: ) is an anonymous group: it combines all elements within parenthesis, but does not return the group result as a separate match. In other words: it provides a purely logical grouping without affecting the returned results.
And in order for you to never wonder anymore what a ”?” in this pattern does, you could assign this part of the pattern to a “speaking” variable as well:
$ZeroOrOneTime = '?'
$byte = "(?:25[0-5]|2[0-4][0-9]|[01]$ZeroOrOneTime[0-9][0-9]$ZeroOrOneTime)"
I am not saying this turns Regular Expressions into poetry but this way you can focus on one fundamental building block at a time, and once a building block is done and works as expected, you can forget about its inner workings and reuse these blocks to compose patterns. It’s really like writing PowerShell functions to reuse code. Here, you reuse patterns.
Composing Regular Expressions
With your building blocks, you can now compose complex patterns that are better manageable. This would create a pattern for IPv4 IP addresses:
# building blocks:
$ZeroOrOneTime = '?'
$byte = "(?:25[0-5]|2[0-4][0-9]|[01]$ZeroOrOneTime[0-9][0-9]$ZeroOrOneTime)"
$dot = '\.'
# composing a pattern (IPv4 address):
$IPv4 = "$byte$dot$byte$dot$byte$dot$byte"
# using the pattern:
do
{
$ip = Read-Host 'Enter IPv4'
} while ($ip -notmatch $ipV4)
# viewing the pattern in use:
"You just used this pattern: $IPv4"
So essentially, your pattern…
$byte$dot$byte$dot$byte$dot$byte
…turned into this Regular Expression:
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
You Decide: Building Your RegEx Library
You have seen how much easier and more manageable Regular Expressions can become when you componentize them. I recommend you start building your own Regular Expression library, and whenever you come across something new that is not yet covered by your RegEx library, add it to your library as a new building block first.
Whether you compose simple patterns from just a few building blocks or add more Regular Expression features like quantifiers is completely up to you.
Here is my final example to check IPv4 IP addresses:
# building blocks:
$ZeroOrOneTime = '?'
$3Times = '{3}'
$byte = "(?:25[0-5]|2[0-4][0-9]|[01]$ZeroOrOneTime[0-9][0-9]$ZeroOrOneTime)"
$dot = '\.'
$StartOfText = '^'
$EndOfText = '$'
# composing a pattern (IPv4 address):
$IPv4 = "$StartOfText($byte$dot)$3Times$byte$EndOfText"
# using the pattern:
do
{
$ip = Read-Host 'Enter IPv4'
} while ($ip -notmatch $ipV4)
# viewing the pattern in use:
"You just used this pattern: $IPv4"
Composition Only: Hide Your Library
If you like, you could even use your Regular Expression building blocks solely for private use. In your scripts that go out to customers, you paste the resulting Regular Expression, and your status as Regular Expression King remains untouched.
So this is what you do in your lab:
# calculate the regex pattern:
# building blocks:
$ZeroOrOneTime = '?'
$3Times = '{3}'
$byte = "(?:25[0-5]|2[0-4][0-9]|[01]$ZeroOrOneTime[0-9][0-9]$ZeroOrOneTime)"
$dot = '\.'
$StartOfText = '^'
$EndOfText = '$'
# composing a pattern (IPv4 address):
$IPv4 = "$StartOfText($byte$dot)$3Times$byte$EndOfText"
# copy result into your production script:
$ipV4 | Set-ClipBoard
And this is what the production code looks like that your customers get:
# paste your pattern here:
$pattern = '^((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
do
{
$ip = Read-Host 'Enter IPv4'
} while ($ip -notmatch $pattern)
"Entered and validated IPv4: $ip"