We live in an information age where large volumes of data abound and the ability to extract meaningful information from data is a key differentiator for success. Fields such as analytics, data mining and data science are devoted to the study of data. In this article we will look at an essential, simple and powerful tool in the data scientist’s toolbox – the regular expression or regex for short. We will learn about regex and how to use them in python scripts to process textual data.

Text is one of the basic forms of data and humans use text for communicating and expressing themselves such as in web pages, blog posts, documents, twitter/ RSS feeds, etc. This is where Regular Expressions are handy and powerful. Be it filtering data from web pages, data analytics or text mining – Regular expressions are the preferred tool to accomplish these tasks. Regular expressions make text processing tasks, like (NLP) simpler, thereby reducing efforts, time and errors which are bound to occur while writing manual scripts.

In this article, we will understand what are regular expressions and how they can be used in Python. Next, we will walk through usage and applications of commonly used regular expressions.

By the end of the article, you will learn how you can leverage the power of regular expressions to automate your day-to-day text processing tasks.

What is a Regular Expression?

A regular expression (RE or regex) is a sequence of characters which describes textual patterns. Using regular expressions we can match input data for certain patterns (aka searching), extract matching strings (filtering, splitting) as well as replace occurrences of patterns with substitutions, all with a minimum amount of code.

Most programming languages have built-in support for defining and operating with regular expressions. Perl, Python & Java are some notable programming languages with first-class support for regular expressions. The standard library functions in such programming languages provide highly-performant, robust and (almost) bug-free implementations of the regular expression operations (searching, filtering, etc.) that makes it easy to rapidly produce high-quality applications that process text efficiently.

Getting started with Python Regular expressions

Python provides a built-in module called re to deal with regular expressions. To import Python’s re package, use:

Copy
import re

The re package provides set of methods to perform common operations using regular expressions.

Searching for Patterns in a String

One of the most common tasks in text processing is to search if a string contains a certain pattern or not. For instance, you may want to perform an operation on the string, based on the condition that the string contains a number. Or, you may want to validate a password by ensuring it contains numbers and special characters. The`match` operation of RE provides this capability.

Python offers two primitive operations based on regular expressions: re.match() function checks for a pattern match at the beginning of the string, whereas re.search() checks for a pattern match anywhere in the string. Let’s have a look at how these functions can be used:

The re.match() function

The re.match() function checks if the RE matches at the beginning of the string. For example, initialise a variable “text” with some text, as follows:

Copy
text = ['Charles Babbage is regarded as the father of computing.', 'Regular expressions are used in search engines.']

Let’s write a simple regular expression that matches a string of any length containing anything as long as it starts with the letter C:

Copy
regex = r"C.*"

For now, let’s not worry about how the declaration above is interpreted and assume that the above statement creates a variable called regex that matches strings starting with C.

We can test if the strings in text match the regex as shown below:

Copy
for line in text: ans = re.match(regex, line) type(ans) if(ans): print(ans.group(0))

Go ahead and run that code. Below is a screenshot of a python session with this code running.

Regex Match Search Example 1

Regex Match Search Example 1

The first string matches this regex, since it stats with the character “C”, whereas the second string starts with character “R” and does not match the regex. The `match` function returns _sre.SRE_Match object if a match is found, else it returns None.

In python, regular expressions are specified as raw string literals. A raw string literal has a prefix r and is immediately followed by the string literal in quotes. Unlike normal string literals, Python does not interpret special characters like '\' inside raw string literals. This is important and necessary since the special characters have a different meaning in regular expression syntax than what they do in standard python string literals. More on this later.

Once a match is found, we can get the part of the string that matched the pattern using group() method on the returned match object. We can get the entire matching string by passing 0 as the argument.

Copy
ans.group(0)

Sample Output:

Copy
Charles Babbage is regarded as the father of computing.

Building blocks of regular expressions

In this section we will look at the elements that make up a regex and how regexes are built. A regex contains groups and each group contains various specifiers such as character classes, repeaters, identifiers etc. Specifiers are strings that match particular types of pattern and have their own format for describing the desired pattern. Let’s look at the common specifiers:

Identifiers

An identifier matches a subset of characters e.g., lowercase alphabets, numeric digits, whitespace etc.,. Regex provides a list of handy identifiers to match different subsets. Some frequently used identifiers are:

  • \d = matches digits (numeric characters) in a string
  • \D = matches anything but a digit
  • \s = matches white space (e.g., space, TAB, etc.,.)
  • \S = matches anything but a space
  • \w = matches letters/ alphabets & numbers
  • \W = matches anything but a letter
  • \b = matches any character that can separate words (e.g., space, hyphen, colon etc.,.)
  • . = matches any character, except for a new line. Hence, it is called the wildcard operator. Thus, “.*” will match any character, any nuber of times.

Note: In the above regex example and all others in this section we omit the leading r from the regex string literal for sake of readability. Any literal given here should be declared as a raw string literal when used in python code.

Repeaters

A repeater is used to specify one or more occurrences of a group. Below are some commonly used repeaters.

The `*` symbol

The asterisk operator indicates 0 or more repetitions of the preceding element, as many as possible. ‘ab*” will match ‘a’, ‘ab’, ‘abb’ or ‘a’ followed by any number of b’s.

The `+` symbol

The plus operator indicates 1 or more repetitions of the preceding element, as many as possible. ‘ab+’ will match ‘a’, ‘ab’, ‘abb’ or ‘a’ followed by at least 1 occurrence of ‘b’; it will not match ‘a’.

The `?` symbol

This symbol specifies the preceding element occurs at most once, i.e., it may or may not be present in the string to be matched. For example, ‘ab+’ will match ‘a’ and ‘ab’.

The `{n}` curly braces

The curly braces specify the preceding element to be matched exactly n times. b{4} will match exactly four ‘b’ characters, but not more/less than 4.

The symbols *,+,? and {} are called repeaters, as they specify the number of times the preceding element is repeated.

Miscellaneous specifiers

The `[]` square braces

The square braces match any single character enclosed within it. For example [aeiou] will match any of the lowercase vowels while [a-z] will match any character from a-z(case-sensitive). This is also called a character class.

The `|`

The vertical bar is used to separate alternatives. photo|foto matches either “photo” or “foto”.

The `^` symbol

The caret symbol specifies the position for the match, at the start of the string, except when used inside square braces. For example, “^I” will match a string starting with “I” but will not match strings that don’t have “I” at the beginning. This is essentially same as the functionality provided by the re.match function vs re.search function.

When used as the first character inside a character class it inverts the matching character set for the character class. For example, “[^aeiou]” will match any character other than a, e, i, o or u.

The `$` symbol

The dollar symbol specifies the position for a match, at end of the string.

The `()` paranthesis

The parenthesis is used for grouping different symbols of RE, to act as a single block. ([a-z]\d+) will match patterns containing a-z, followed by any digit. The whole match is treated as a group and can be extracted from the string. More on this later.

Typical use-cases for Python Regular Expressions

Now, we have discussed the building blocks of writing RE. Let’s do some hands-on regex writing.

The re.match() function revisited

It is possible to match letters, both uppercase and lowercase, using match function.

Copy
ans = re.match(r"[a-zA-Z]+", str) print(ans.group(0))

The above regex matches the first word found in the string. The `+` operator specifies that the string should have at least one character.

Sample Output:

Copy
The

As you see, the regex matches the first word found in the string. After the word “The”, there is a space, which is not treated as a letter. So, the matching is stopped and the function returns only the first match found. Let’s say, a string starts with a number. In this case, the match() function returns a null value, though the string has letters following the number. For example,

Copy
str = "1837 was the year when Charles Babbage invented the Analytical Engine" ans = re.match(r"[a-zA-Z]+", str) type(ans)

The above regex returns null, as the match function returns only the first element in the string. Though the string contains alphabets, it is preceded by a number. Therefore, match() function returns null. This problem can be avoided using the search() function.

The re.search() function

The search() function matches a specified pattern in a string, similar to match() function. The difference is, the search() function matches a pattern globally, unlike matching only the first element of a string. Let’s try the same example using search() function.

Copy
str = "1837 was the year when Charles Babbage invented the Analytical Engine" ans = re.search(r"[a-zA-Z]+", str) type(ans)

Sample Output:

Copy
was

This is because the search() function returns a match, though the string does not start with an alphabet, yet found elsewhere in the string.

Matching strings from start and from end

We can use regex to find if a string starts with a particular pattern using caret operator ^. Similarly, $ a dollar operator is used to match if a string ends with a given pattern. Let’s write a regex to understand this:

Copy
str = "1937 was the year when Charles Babbage invented the Analytical Engine" if re.search(r"^1837", str): print("The string starts with a number") else: print("The string does not start with a number") type(ans)

Sample Output:

Copy
The string starts with a number

The re.sub() function

We have explored using regex to find a pattern in a string. Let’s move ahead to find how to substitute a text in a string. For this, we use the sub() function. The sub() function searches for a particular pattern in a string and replaces it with a new pattern.

Copy
str = "Analytical Engine was invented in the year 1837" ans = re.sub(r"Analytical Engine", "Electric Telegraph", str) print(ans)

As you see, the first parameter of the sub() function is the regex that searches for a pattern to substitute. The second parameter contains the new text you wish to substitute for the old one. The third parameter is the string on which the “sub” operation is performed.

Sample Output:

Copy
Electric Telegraph was invented in the year 1837

Writing Regexes with identifiers

Let’s understand using regex containing identifiers, with an example. To remove digits in a string, we use the below regex:

Copy
str = "Charles Babbage invented the Analytical Engine in the year 1937" ans = re.sub(r"\d", "", str) print(ans)

The above script locates for digits in a string using the identifier “\d” and replaces it with an empty string.

Sample Output:

Copy
Charles Babbage invented the Analytical Engine in the year

Splitting a string

The re package provides the split() function to split strings. This function returns a list of split tokens. for example, the following “split” function splits string of words, when a comma is found:

Copy
str = "Charles Babbage was considered to be the father of computing, after his invention of the Analytical Engine, in 1837" ans = re.split(r"\,", str) print(ans)

Sample Output:

Copy
['Charles Babbage was considered to be the father of computing', 'after his invention of the Analytical Engine', 'in 1837']

The re.findall() function

The findall() function returns a list that contains all the matched utterances in a string.

Let’s write a script to find domain type from a list of email id’s implementing the findall() function:

Copy
result=re.findall(r'@\w+.\w+','joe.sam@gmail.com, reema@yahoo.in, demo.user@samskitchen.com) print result

Sample Output:

Copy
['@gmail.com', '@yahoo.in', '@samskitchen.com']

Conclusion

In this article, we understood what regular expressions are and how they can be built from their fundamental building blocks. We also looked at the re module in Python and its methods for leveraging regular expressions. Regular expressions are a simple yet powerful tool in text processing and we hope you enjoyed learning about them as much as we did building this article. Where could you use regex in your work/ hobby projects? Leave a comment below.