Regular expressions are a formatted sequence or pattern of characters that can be used in a search operation. They are written in a specific syntax and then are usually used to search for patterns in other text, or returning whether or not that text has met the pattern.
Python has a built-in module just for this, the
Here are the functions that the
re module offers to us:
findall: Returns a list with all the matches
search: Returns a Match object if a match was found
split: Returns a list of the string split at every match
sub: Substitutes all the matches with a string
Let's see these all in action.
findall() function when you want to find all the matches you have described:
import re example = "I pledge allegiance." results = re.findall("le", example) print(results)
le twice. If nothing was found, the list returned will be empty. You can take the length of this list to the number of results found.
search() function searches the string for a match. It returns back a
import re example = "I pledge allegiance." results = re.search("le", example) print(results)
<re.Match object; span=(18, 20), match='le'>
re.Match object, you can get the index of the first match, like this:
import re example = "I pledge allegiance." results = re.search("le", example) print(results.start())
split() function returns the string split at every match.
import re example = "I pledge allegiance." results = re.split("le", example) print(results)
['I p', 'dge al', 'giance.']
Pretty straightforward, it cuts out all the string passed in when matched, and splits the string at that point.
sub() function substitutes a match with a string of your choice:
import re example = "I pledge allegiance." results = re.sub("le", "ABC", example) print(results)
I pABCdge alABCgiance.
In addition to string literals, you can use special sequences in your regular expressions to make them more powerful.
Here is a list of the special sequences you can use:
.: Matches any character
\w: Matches an alphanumeric character (includes underscores)
\W: Matches a non-alphanumeric character (excludes underscores)
\b: Space between word and non-word characters
\s: Matches a single whitespace character
\S: Matches a non-whitespace character
\t: Matches a tab
\n: Matches a newline
\r: Matches a return
\d: Matches a numeric character
\^: Matches the start of a string
\$: Matches the end of a string