Regular Expressions in Python

Regexps in Python

A regular expression is a sequence of characters characterizing a search design, that is, a pattern describing a set of strings. These patterns are utilized for searching and editing text, replacing one substring with another, etc.

The most straightforward case of utilizing a regular expression is the point at which we search for some words in a text file or on a site page. For instance, in the event that we search for “python”, the string “python” turns into a simple regular expression — a search pattern that relates just to “python” and that’s it. The more complicated regular expression will have the option to match a bigger number of strings.

Re module and match()

You can utilize the power of regexps if you refer to a standard Python module called re. That is, to utilize anything identified with regexps in Python, you should initially import this module.

import re

This module furnishes you with a few functions that search for matches for your regular expressions in various manners. How about we get acquainted with one of these functions, match().

It acknowledges a regular expression pattern (first argument) and a string (second argument) and checks whether there’s a match for the pattern at the start of the string.

regexp = 'instagram'
string = 'instaagram'
res = re.match(regexp, string)

If there’s no match for your regexp directly at the start of the string, match returns None value. Else, the function returns a special structure called match object that will contain the data about the found match.

We won’t go into the nature of this object at the right now: all we have to know is that a match object is consistently a result of a successful match, and None is consistently a consequence of no found matches. In this manner, to know whether we have a match, we basically need to check whether the result is equivalent to None.

result = re.match('instagram', 'instaagram')
print(result is None)

Output

Let’s try out some other examples! Here, there’s a successful application of match() function:

result = re.match('hedge', 'hedgehog')
print(result is None)

Output

Don’t forget that match() won’t help you with finding parts of the string that match the template, but aren’t located in the beginning of the string. Check out this example:

result = re.match('dog', 'bulldog')
print(result is None)

Output

You might also want to note that even if the match is an empty string, match object will still be equal to True because the length of the matching string doesn’t matter: the only the presence of match does.

result = re.match('', 'not an empty string') 
print(result is None)

Output

The example above recommends that you should be cautious with empty templates: despite the fact that it might appear to be counterproductive, they don’t match just empty strings, they match all strings (at least, when you use match() function and check the presence of matching substring in the start of the string).

Remember that regular expressions naturally are case sensitive, that is, it’s a serious deal whether you utilize upper or lower case letters in your template. Two identical letters of the various cases won’t match one another.

result = re.match('BOOKS', 'books')
print(result is None)

Output

re.findall()

The re.findall() method returns a list of strings containing all matches.

import re

string = 'Lets check for digits 12345678'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

#output
#['12345678']

re.split()

The re.split methods splits the string where there is a match and returns a list of strings where the splits have happened.

import re

string = 'Lets check for digits 12345678'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

#output
#['Lets check for digits ', '']

re.sub()

The method returns a string where matched occurrences are replaced with the content of the replace variable.

import re

string = 'Lets check for digits 12345678'

pat = '\s+'

rep = ''

new_string = re.sub(pat, rep, string) 
print(new_string)

#output
#Letscheckfordigits12345678

re.search()

The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None.

import re

string = "Hey, it was pleasure meeting you"

match = re.search('\AHey', string)

if match:
  print("Yes")
else:
  print("No match") 

#output
#Yes

Metacharacters

Metacharacters are characters that are deciphered in a special manner by a RegEx engine. Here’s a list of metacharacters:

^ : Beginning of line

string = 'apple'
a = ('^a',string)
print(a is None)

#returns False since it starts with a

$ : End of line

string = "have a good day"

x = re.findall("day$", string)
if x:
  print("Yes")
else:
  print("No match")

#it will return Yes since it ends with day

| : Or

NOTE: Not applicable to basic regular expressions.

import re

string = "have a good day"

#Check if the string contains "a" followed by exactly two "l" characters:

x = re.findall("go{2}d", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#prints
#['good']
#Yes

. : Match any single character

import re

string = "hello world"

x = re.findall("he..o", string)
print(x)

#output
#['hello']

( ) : Group the regular expression within the parentheses

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("(a|e|u)t", string)
print(x)

#output
#['e']

? : Match zero or one of the preceding expression

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("m?ting", string)
print(x)

#output
#['ting']

* : Match zero, one, or many of the preceding expression

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("ou*", string)
print(x)

#output
#['ou']

+ : Match one or many of the preceding expression

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("aaz+", string)
print(x)

#output
#[]

\ : Use the literal meaning of the metacharacter

import re

string = "Let's check for digits 12345678"

x = re.findall("\d", string)
print(x)

#output
#['1', '2', '3', '4', '5', '6', '7', '8']

Sets

[abc] : Match any character enclosed in the brackets

import re

string = "Basics of abc and digits"

x = re.findall("[abc]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['a', 'c', 'a', 'b', 'c', 'a']
#Yes

[^abc] : Match any character not enclosed in the brackets

import re

string = "Basics of abc and digits"

x = re.findall("[^abc]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['B', 's', 'i', 's', ' ', 'o', 'f', ' ', ' ', 'n', 'd', ' ', 'd', 'i', 'g', 'i', 't', 's']
#Yes

[a-z] : Match the range of characters specified by the hyphen

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("[a-z]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['e', 'y', 'i', 't', 'w', 'a', 's', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', 'm', 'e', 'e', 't', 'i', 'n', 'g', 'y', 'o', 'u']
#Yes

alnum = Uppercase and lowercase alphabetic characters and numbers : [A-Za-z0-9]

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("[A-Za-z0-9]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['H', 'e', 'y', 'i', 't', 'w', 'a', 's', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', 'm', 'e', 'e', 't', 'i', 'n', 'g', 'y', 'o', 'u']
#Yes

alpha = Uppercase and lowercase alphabetic characters : [A-Za-z]

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("[A-Za-z0-9]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['H', 'e', 'y', 'i', 't', 'w', 'a', 's', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', 'm', 'e', 'e', 't', 'i', 'n', 'g', 'y', 'o', 'u']
#Yes

upper = Uppercase alphabetic characters : [A-Z]

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("[A-Z]", string)

print(x)

if x:
  print("Yes")
else:
  print("No match")

#output
#['H']
#Yes

Special Sequences

\A : Returns a match if the specified characters are at the beginning of the string

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("\Aso", string)
print(x)

#output
#[]

\b : Returns a match where the specified characters are at the beginning or at the end of a word

import re

string = "Hey, it was pleasure meeting you"

x = re.findall(r"\bit", string)
print(x)

#output
#['it']

\B : Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word

import re

string = "Hey, it was pleasure meeting you"

x = re.findall(r"\Bit", string)
print(x)

#output
#[]

\Z : Returns a match if the specified characters are at the end of the string

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("pleasure\Z", string)
print(x)

#output
#[]

\W : Returns a match where the string DOES NOT contain any word characters

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("\W", string)
print(x)

#output
#[',', ' ', ' ', ' ', ' ', ' ']

\w : Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("\w", string)
print(x)

#output
#['H', 'e', 'y', 'i', 't', 'w', 'a', 's', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', 'm', 'e', 'e', 't', 'i', 'n', 'g', 'y', 'o', 'u']

\S : Returns a match where the string DOES NOT contain a white space character

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("\S", string)
print(x)

#output
#['H', 'e', 'y', ',', 'i', 't', 'w', 'a', 's', 'p', 'l', 'e', 'a', 's', 'u', 'r', 'e', 'm', 'e', 'e', 't', 'i', 'n', 'g', 'y', 'o', 'u']

\s : Returns a match where the string contains a white space character

import re

string = "Hey, it was pleasure meeting you"

x = re.findall("\s", string)
print(x)

#output
#[' ', ' ', ' ', ' ', ' ']

Conclusion

For handling regular expression in Python, the re module is used.
match() function of the re module checks whether there’s any substring in the beginning of the string that matches your regexp template.
The result of match() function is either None or a match object.
Match object converted to bool always equals True.
Regular expressions by default are case-sensitive.
Dot . replaces any character except for \n, question mark? means that the previous character is optional and can be missing from a string.