Data Science Terminology

An Introduction To REGEX


The REGEX! Despite sounding like something out of Jurassic Park, REGEX is simply an acronym to denote 'Regular Expression'  (I wish it could have been more exciting too!). 

What are Regular Expressions (REGEX)? 

A Regular Expression is a sequence of characters that specifies a search pattern [Source: Wikipedia] - I don't know why Wikipedia makes everything sound so cryptic. All this is saying is that a Regular Expression is simply a string of text that permits a user to create a pattern that aids in matching, locating, and managing text. 

If you're new to Machine Learning, here's a fantastic article for you to dive into the basics! 

Python has the REGEX tool built-in to the programming language, therefore, we don't download any frameworks to access it. Regular Expressions could also be used as a command-line tool or within text editors, in case you may have wanted to search for a file. 

In the initial phases, trying to understand Regex could feel like you're writing in hieroglyphics. Nonetheless, attaining mastery of the Regex could potentially save you tons of hours if you're working with unstructured data; namely text or you need to parse a large database. Other use cases include: Web Scraping, Searching and Replacing, Filtering, and Form Input Validation.  

To understand the full grasp of what's going on, there are some key terminologies you out to know:

  • Character: a letter, digit, or symbol that doesn't have special meaning in regex, therefore, they match themselves.
  • Letter: A-Z, a-z
  • Digit: 0-9
  • Symbol: !"£$%^&*(){}[];:'.,/
  • Whitespace: Space or Tab
  • Test String: used to match the pattern
  • Pattern: regex pattern 

Note: A great tool to use when learning Regex is Regex101. Regex101 allows you to build and test regex patterns against test strings.

The Basics 

At first glance, Regex could look quite overwhelming. However, like in most situations, a firm grasp of the basics would provide you with a strong enough foundation to go on to mastering some of the more complicated patterns. Let's take a look at some of the basic patterns we could use:

Pattern Function
^ Regex special character  which is used to match the beginning of the line
$ Regex special character to match the end of the line 
. Regex special character which is used to match any character
(pattern) Capture anything matched within the brackets
(?: )  Non-capturing group
[pattern] Match anything in the brackets
[^pattern] Match anything that is not contained in the brackets
[a-z] Match any character between "a" and "z". Could also use capital letters i.e. [A-Z] 
{x} Match exactly "x" times
{x,} Match "x" or more times
{x, y} Match from "x" to "y" both inclusive
* Regex special character which is used  to match 
+ Regex special character which is used to match the character before the "+" 1 or more times
? Regex special character which is used to match the character before the "?" 0 or 1 time. Also used as a non-greedy matcher
\ Regex special character which is used to escape the character after the backslash. Also used to create an escape sequence (more information below)

 

Escape Sequences 

To match characters that fall into the category of regex special characters, it's essential that we use an escape sequence, which is prefixed with a "\". However, the "\" special character also has functionalities outside of the special characters of which I will demonstrate below. Here are some examples of escape sequences: 

Pattern Function
\SPECIAL CHARACTER When performing the "\" with any special character, it will be escaped 
\D Match any character that is not a digit
\d Matches any digit
\n Match a new line
\s Match a space character (i.e. space, \t, \r, \n)
\t  Match a tab
\w Match any word character
\W Match any non-word character 
\S Match any non-whitespace character 

 

Wrap Up 

While using the AI & Analytics Engine, there may be occurrences where you would need to leverage Regex to match patterns within your data. Although this is article is by no means an extensive introduction to Regex, it's enough to kickstart your journey using the tool. If you wish to dive deeper into this topic, a tutorial I'd recommend is the Learning Regular Expressions tutorial by Corey Schafer on Youtube.

Be sure to book a demo to find out how the team at PI.EXCHANGE can help you get your project over the line. 

Book a demo 

Similar posts