(once you embrace it, you’ll save tons of time)
With so many talented coders out there delivering new applications and enhancing software programs every day, the practice of regular expressions (regexes) flies way under the radar. It could possibly be due to the perception that it’s tricky to use, or just that coding is the default preference because it is more common – even if not the best option. However, by mastering the basics, you can quickly and easily begin to automate tasks to find, parse, split, or extract the precise information you need from text, files, or data. In this blog, we’ll review what regexes are, how to write them, why they provide value, and use cases with examples.
Regexes Demystified
Put simply, regexes are a written pattern of characters or metacharacters composed to match sequences of characters (referred to as strings) within text, files, or data. The majority of these patterns use normal ASCII, which includes letters, digits, punctuation, and other symbols. We’ve all searched file folders for a document using a variable to locate the file we need. Let’s say we know it’s an Excel spreadsheet, so we enter .xlsx in the search bar. If we apply that same thought process when using a regex, the token in that regex string would be «.*\.xlsx». Regexes can increase in length and complexity by adding tokens to the string to zero in on the desired information. More on that in the examples below.
These intricate searches for exact information are driven by a regex engine (e.g., Perl, PCRE, .NET, JavaScript, Python), which matches the pattern to a given string within your targeted text, file, or data. No matter which engine type you are using, it’s important to regularly check for updates or new features to ensure you are maximizing the benefits.
So, What’s the Point?
The purpose of using regexes can vary, but the benefit is constant. It increases efficiency and saves valuable time. A simple regex can eliminate the need to write line after line of code that would produce the same result. It also replaces the need for a skilled employee to scour emails and documents looking for specific information. Not only does this practice automate the process of sifting through massive amounts of text, but configuring it only requires a single regex. It’s a small investment of time up front that yields recurring benefits.
The main uses cases for using regex are as follows:
- Finding specific text within a larger body of text
- Validating that a string conforms to a specific format
- Replacing text or insert text at matched positions
- Splitting strings in matches to create lists, arrays, etc.
In this blog, we’ll focus on the first use case of finding desired text within a larger body of text, using email examples. The examples are based on the most commonly asked questions we receive from our customers when using regex.
Example: Pulling out Only the Body of an Email
In this scenario, the customer wanted to scan an email with a regex to strip out undesired information, such as a greeting and signature, only capturing the body of the email. In creating the regex, we need to incorporate the variations of a signature and greeting/header in order to indicate what information is not wanted. Then, we instruct the regex to pull everything else out of the email between the greeting and signature. To illustrate this scenario, we’ll use the following sample email:
———
Subject: HELLO!
DATE: 04/25/18
TO: me@email.com
FROM: you@email.com
Dear Sir/Madam,
My name is John Smith. I was hoping you could assist me in this matter. I have an account with your company and I’d like to have my product serviced. Please contact me to schedule an appointment at your earliest convenience.
Thanks,
John Smith
Vice President
XYZ Company
Phone, Email
———-
Since we only want to capture the information between “Dear Sir/Madam,” and “Thanks” the regex would look like this: (?:Dear Sir\/Madam,\s*)(\s*(?s).\s)(?:Thanks)
Example: Locating a Start Time
Let’s say you’re looking for every email that contains a start time that you know will be in the format of 00:00 AM/PM. The regex will look for anything in the email that might match, such as 1:00 PM or 11:45 am. So, the regex string would look like the following:
\W(\d{1,2}\:\d{2})\s*(AM|PM|am|pm) \W
In using this regex, the result will be emails that contain “1:00 PM” and “11:45 am.” Now, the fun part. By creating one regex, every email that contains these start times will automatically be located and identified. You create one regex and it yields the desired result – continuously. Now, that’s a clear and demonstrable return on investment.
Let’s break this regex down token by token:
\W – Refers to 1 character of whitespace at the beginning of the string
( -Represents the beginning of the first grouping
\d{1,2} -Indicates that you’re looking for either 1 or 2 digits (e.g., 1 or 11, in our time example)
\: -Means you’re looking for a literal match of the character in the targeted text
\d{2} – Instructs it to find 2 digits
) -Ends the first grouping
\s* – Looks for a match of any number of whitespace from 0 to infinity. This means you’re taking into account that your times might not be perfect and look something like, “11:00am” or “11:45 PM”
(AM|PM|am|pm) -Instructs itto find “AM” OR find “PM” OR find “am” OR find “pm.” If you wanted to be even more thorough, you could include “Am|Pm”
\W – Indicates one more whitespace character at the end of the string
Example: Looking for Customer Account Numbers
If you need to locate emails that contain your customers’ account numbers, you know you’re searching for a pattern that may contain both letters and numbers, and that will be 10 characters long (e.g., 1234567890, aBcDeFgHiZ, or 012cdeFG00). The regex would look like this: (\d{10}|(?i)\w{10}). The parentheses at either end represent one group of characters that appear together.
Again, by creating one regex, you can automatically search any number of emails for 10 character combinations, producing only the emails that contain customer account numbers.
Following is the breakdown of the individual tokens in the regex string:
\d{10} –Instructs it to look for 10 digits
| – Signifies the alternation of the tokens on either side
(?i) -Means it can match the remainder pattern with this flag: case insensitive
\w{10} -Instructs the search to match 10 letter characters or find a combination of 10 digits or 10 letters regardless of being upper or lower case
Why Wait to Try It?
Regexes are a powerful tool to quickly and accurately find, parse, and extract key information from massive amounts of text, without consuming valuable human capital. It’s like findinga needle in a haystack – automatically. Once you are more familiar with using regexes, you can create them with more intricacy, expanding their uses and capabilities. It’s a small investment of time that gives you back talent, energy, and resources to apply to bigger and better things.
To see how regexes are being used in email handlers, visit https://www.forty8fiftylabs.com/products/smarthandler/