Section 12.2 Character Matching in Regular Expressions
There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period or full stop, which matches any character.
In the following example, the regular expression F..m:
would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:” since the period characters in the regular expression match any character.
Checkpoint 12.2.1.
This code searches for lines that start with ‘F', follow by 2 characters, followed by ‘m:'
This is particularly powerful when combined with the ability to indicate that a character can be repeated any number of times using the *
or +
characters in your regular expression. These special characters mean that instead of matching a single character in the search string, they match zero-or-more characters (in the case of the asterisk) or one-or-more of the characters (in the case of the plus sign).
Checkpoint 12.2.2.
11-9-2: Which symbol in regex matches zero or more characters in the search string?
+
The plus is used to match more than one character.
^
The carrot is used to match characters at the beginning of a string.
*
Correct! The asterisk is used to match zero or more characters.
.
The period is used to match any character.
We can further narrow down the lines that we match using a repeated wild card character in the following example:
Checkpoint 12.2.3.
This code searches for lines that start with ‘From:' and have an ‘@' symbol.
The search string ^From:.+@
will successfully match lines that start with “From:”, followed by one or more characters (.+
), followed by an at-sign. So this will match the following line:
From: stephen.marquard@uct.ac.za
You can think of the .+
wildcard as expanding to match all the characters between the colon character and the at-sign.
From:.+@
It is good to think of the plus and asterisk characters as “pushy” or “greedy”. For example, the following string would match the last at-sign in the string as the .+
pushes outwards, as shown below:
From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu
It is possible to tell an asterisk or plus sign not to be so “greedy” by adding another character. See the detailed documentation for information on turning off the greedy behavior.
Checkpoint 12.2.4.
11-9-4: Select all of the lines that will be printed when the following code is run. (\$ is used to match the character ‘$')
import re
hand = open('mbox-short-re2.txt')
for line in hand:
line = line.rstrip()
if re.search('\$.+', line):
print(line)
It will cost you $1.00
Correct! There is a dollar sign followed by one or more characters.
From: stephen.marquard@uct.ac.za $
The .+ indicates that there need to be characters following the $.
$2.50 is your change
Correct. The dollar sign in this line is followed by more than one character.
Your change is two dollars and fifty cents.
Try again! There needs to be at least a $ in the line.
Checkpoint 12.2.5.
11-9-5: Looking at the code-block below, what parts of it will be matched by the regex equation re.search('From:.+@')
?
From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu
From: stephen.marquard@
'^From:.+@' will match this.
From: stephen.marquard@uct.ac.za, csev@
Remember the + and * characters in regex are pushy!
From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @
Correct! The + and * characters are greedy, so this will capture the entire statement and not just to the first @ sign.
From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu
It stops at the last @