One trick a day: Make your regular expressions a hundred times more readable

Original link: https://www.kingname.info/2022/06/20/readable-re/

Regular expressions are powerful, but they are written like an emoji. When I look at the expression I wrote in a month, I don’t even remember what it means. For example the following:

 1
 pattern = r"((?:\(\s*)?[AZ]*H\d+[az]*(?:\s*\+\s*[AZ]*H\d+[az]*)* (?:\s*[\):+])?)(.*?)(?=(?:\(\s*)?[AZ]*H\d+[az]*(?:\s* \+\s*[AZ]*H\d+[az]*)*(?:\s*[\):+])?(?![^\w\s])|$)"

Is there any way to improve the readability of regular expressions? We know that one of the ways to improve code readability is to write comments, so can regular expressions write comments?

For example for the following sentence:

 1
 msg = 'My name is Qingnan, my password is: 123kingname456, please keep it secret. '

I want to extract the password 123kingname456 , so my regular expression might be like this:

 1
 pattern = ':(.*?),'

Can I write it like this:

 1
2
3
4
5
 pattern = '''
: # start flag
(.*?) #Any character starting from the next character of the start flag
, #Stop when a comma is encountered
'''

This way, the writing is much clearer, and what each part does is clear.

But obviously nothing can be extracted by direct use, as shown in the following figure:

But when I was browsing the Python regular expression documentation today, I found a good thing:

Using it, you can make your regular expressions have comments, as shown in the following image:

re.VERBOSE can also be referred to as re.X for short, as shown in the following figure:

The complex regular expression at the beginning of this article will become more readable after using comments:

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
 pattern = r"""
( # code (capture)
# BEGIN multicode

(?: \( \s* )? # maybe open paren and maybe space

# code
[AZ]*H # prefix
\d+ # digits
[az]* # suffix

(?: # maybe followed by other codes,
\s* \+ \s* # ... plus-separated

# code
[AZ]*H # prefix
\d+ # digits
[az]* # suffix
)*

(?: \s* [\):+] )? # maybe space and maybe close paren or colon or plus

# END multicode
)

( .*? ) # message (capture): everything ...

(?= # ... up to (but excluding) ...
# ... the next code

# BEGIN multicode

(?: \( \s* )? # maybe open paren and maybe space

# code
[AZ]*H # prefix
\d+ # digits
[az]* # suffix

(?: # maybe followed by other codes,
\s* \+ \s* # ... plus-separated

# code
[AZ]*H # prefix
\d+ # digits
[az]* # suffix
)*

(?: \s* [\):+] )? # maybe space and maybe close paren or colon or plus

# END multicode

# (but not when followed by punctuation)
(?! [^\w\s] )

# ... or the end
| $
)
"""

This article is reprinted from: https://www.kingname.info/2022/06/20/readable-re/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment