What is a regular expression?

Regular Expression, commonly referred to as RegEx, is a formal language used to match and extract any required information from a body of data. The language is comprised of differing symbols and characters, that when assembled properly, will define what patterns or variables you are looking to extract.

To a non-technical SEO, RegEx can look massively intimidating and admittedly, the combination of a regular expression’s symbols can look like utter nonsense. However, the reality is actually quite simple. It’s just a case of breaking down a regular expression granularly; examining each character and symbol independently before sussing out how it all fits together and what it’s trying to do.

This guide looks at what’s involved in a regular expression pattern; what the characters and symbols actually mean and how it can be applied to the day-to-day practises of an SEO.

What do the RegEx symbols mean?

In its simplest form, regular expression can be broken down into characters and metacharacters.

Characters take the literal meaning of a character, whether that’s a letter, number, punctuation mark or any other general symbol from typography. Metacharacters, however, are not to be taken literally. Metacharacters are the key components of regular expression with each having it’s own purpose and meaning. Essentially, metacharacters are what make regular expression, regular expression.

Here’s a breakdown of each metacharacter and what it means in the context of RegEx:

 

\ (Backslash)

A backslash command looks to use a metacharacter’s literal meaning, much like a character. You’ll undoubtedly use the backslash in several regular expressions concerned with SEOs as several metacharacters also have their literal place, e.g. dots within root domains and question marks within query strings. To ensure a metacharacter is read as a literal character, place a backslash before it.

 

Delimiters – / (Forward Slash), # (Hash), Tildes (-)

In conjunction to metacharacters, within regular expression we also need to be mindful of delimiters. These are non-alphabetic, non-numeric, non-backslash, and non-whitespace characters. A delimiter is best described as a character used to separate two or more fields of data (think of a CSV export that uses commas to separate its fields). Within regular expression, delimiters also need to be escaped using the backslash.

 

^ (Caret)

A caret command means to begin and match whatever characters are placed after it. So if you’re looking for all pages contained with a specific folder of your site, i.e. folder/, you place a caret before it: ^folder\/

 

$ (Dollar Sign)

The dollar metacharacter can be interpreted as the opposite of a caret. It signifies the end of a search and that you’re not looking for anything further within that folder or directory. For example, \/folder-b$ would extract folder-a/folder-b but not folder-a/folder-b/page-a.

 

| (Pipe)

No doubt all SEOs are familiar with pipes but when it comes to RegEx, a pipe command means “or”. For example, Impression has been referred to as “Impression” and “Impression Digital”. In the language of regular expression, this can be referred to as Impression Digital|Impression.

 

? (Question Mark)

The question mark denotes whether or not to include the singular character that precedes the question mark within a phrase or string. Think of URLS where you’re looking for the trailing slash and without the trailing slashes; for example, folder\/? would capture both folder/ and folder. To include more than one character in a grouping, a parenthesis needs to be used in conjunction with a question mark (see Parenthesis).

 

+ (plus)

Placed after a character, the plus symbol would capture the preceding character as well as any repetition of that character thereafter. Contextually, this would be best used to capture URLs where trailing slashes are not configured properly and can endlessly repeat, i.e. folder\/+ would capture that folder with the trailing slash and any trailing slashes that repeat after.

 

* (Asterisk)

The asterisk is a combination of both the question mark and the plus symbol. It matches the preceding expression, ZERO, or more times of the expression.

 

. (Dot)

A dot can be used to match a non-new line single character, i.e. a letter, number, punctuation mark, etc. The best example of the dot’s application is when it’s used in conjunction with the asterisk to match any number of any character(s). An example of it single use would be where you match different

 

( ) (Parenthesis)

A parenthesis is used to group characters together. To contextualise this metacharacter, think of author names where you want to capture the author’s real name and shortened name/nickname, e.g. Rob(ert)? would capture both Rob and Robert.

 

[ ] (Square Brackets)

Square brackets are used to search and extract a set of letters [a-z] or numbers [0-9]. If you visualise WordPress, often when a second duplicate page is created in conjunction with the first, WordPress will suffix the second page with “-2”. To therefore capture all pages within this URL set, you would use ^page-[0-9]/.

Upper and lower-case variables can be captured by using square brackets, by placing both sets of cases within the bracket, i.e. [A-Za-Z]. [Ii]mpression would capture both Impression and impression.

 

{ } (Curly Brackets)

Curly Brackets can be used to capture characters according to their character length. For example, [0-9]{3} would capture any numbers that are 3 digits long while [a-z]{5} would capture lower case characters (or words) that are 5 characters in length.

Curly Brackets can also match ranges. For example, [a-z]{4,10} would mean match any instance of letters that are between 4 and 10 characters long. In addition, folder\/{0,2} would match any instance of the trailing slash zero, once, or two times.

 

Useful Combinations

 

(.*)

This combination of the parenthesis, dot and asterisk captures any instance of an object, regardless of the variable attached to it. For example, if you’re looking to extract any page URL containing the directory “folder”, you would use ^/folder/(.*)

 

$1

This allows you to extract whatever is captured from the first instance of (.). This function is used commonly in rewrite rules. For example, if you want to change the directory folder of /service-a/ to /service-b/, whilst keeping all the pages either side of folder the same, you would format as: (.)/service-a/(.) $1/service-b/$2 (any additional (.) warrants another dollar sign followed by the next consecutive number).

 

[^…]

If you combine square brackets with a caret, you can ignore the characters defined within the square bracket. For instance, by placing [^0-9] after a URL, no numbered variants of a URL would be captured.

 

\d

Similar to [0-9], this also means catch any number.

 

\w

Whilst with square brackets you need to define specifically which case to capture [A-Z] or [a-z], \w is a quick command that allows you to capture any letter in either case.

 

\W

Captures any character that is NOT a letter, number or underscore.

 

\s

This command allows you to capture space.

 

Summary of RegEx Metacharacters

Just passing by? Here’s a quick summary of each metacharacter for reference:

Metacharacter Definition
 \Uses a metacharacter’s literal meaning
.Replaces a single character
^Begins with
$Ends with
|Or: used to separate similar variables
?Matches the variable with or without the preceding character
+Matches the preceding character one or more times
*Matches the preceding character, zero, or more times
( )Used to group characters together and extracting groups
[ ]Used to extract numbers or letters (case sensitive command)
{ }Used after square brackets, determines what character/number length to capture
(.*)Captures any set object
$1Allows you to capture the object captured from (.*)
[^…]Ignores characters within square brackets
\dCaptures any number
\wCaptures any letter, number, or underscore, regardless of case
\WCaptures a character that is not a letter, number, or underscore
\sCaptures a space

 

Where can Regular Expression be applied?

Google Analytics

Regular Expression can be used throughout Google Analytics, allowing your reporting analysis to become even more powerful and defined. For example, using the pipe function, you can define which IP Addresses or Referring Domains you’d like filtering from your performance data.

Regular Expression for Google Analytics from LunaMetrics is a great resource if you’re looking to learn more about using Regular Expression in conjunction with Google Analytics.

 

Screaming Frog

The industry standard SEO crawling software becomes even more powerful with the incorporation of Regular Expression. For instance, since Screaming Frog already crawls and extracts several SEO components from a webpage, you can create further objects to extract using RegEx. To start with, head to Configuration > Custom > Extraction. From there, configure the drop down menus to “RegEx” to enter in your Regular Expression depending on what component of the source code you wish to capture.

Screaming Frog can do more with Regular Expression, not just its “Extraction“. Read this guide from Screaming Frog for all of Regular Expression’s applications

 

Redirects via htaccess

In on-page SEO, Regular Expression is most commonly used when setting up Redirects via .htaccess. Saved as a text file on your site’s root directory, a .htaccess file allows you to avoid 404 errors and technical duplicate content whilst restoring and consolidating link equity via various Redirect Rules. Below is what a typical RedirectRule may look like:

RewriteRule ^service/search-engine-optimisation/?$ /service/seo/ [R=301,L]

Using the definitions of each metacharacter outlined previously in this guide, we can now deconstruct this RedirectRule. Firstly, after the type of redirect rule is stated (RewriteRule), the regular expression starts with a caret, meaning start by capturing the following folder: service/search-engine-optimisation/ . The question mark then communicates to capture that directory with or without the trailing slash. Finally, the dollar sign signals the completion of the regular expression, i.e. no more folders or pages need to be captured after /search-engine-optimisation/. The RedirectRule then continues with what URL string the old URL should be redirected to.

Please note, [R=301,L] is not part of Regular Expression. Rather, it’s part of mod_rewrite’s syntax that’s simply stating the type of redirect is a 301 (permanent) with the “L” indicating it’s the last command to be processed.

URL Rewriting for Beginners is an excellent resource if you’re looking to master redirects on your site.

 

Regular Expression is nothing to be overwhelmed by. With the knowledge of each metacharacter in place, it’s only a case of going through each element of the command and deconstructing the characters one-by-one. If you have any outstanding questions regarding regular expression or any SEO practices, it’s applicable to (Google Analytics, Screaming Frog, redirects) then feel free to get in touch with our Technical SEO team.

Petar Jovetic

Head of SEO

Petar is the Head of SEO at Impression and specialises in content, technical SEO and digital strategy. He's guaranteed to be the only guitar-wielding, digital marketing-talking, Montenegrin you know.

2 thoughts on “The Definitive Guide to Regular Expression (RegEx)

  1. Pingback: Magento SEO hacks for Ecommerce managers | Impression
  2. Pingback: Top Blog Posts of 2016 | Impression

Leave a Reply

Your email address will not be published. Required fields are marked *