REGEX2
This is mainly for my understanding of Regex as well as its functions and uses predominantly in Python3 as well as in its use in BASH shell scripts and UNIX grep, awk and sed
Last updated
Was this helpful?
This is mainly for my understanding of Regex as well as its functions and uses predominantly in Python3 as well as in its use in BASH shell scripts and UNIX grep, awk and sed
Last updated
Was this helpful?
The name can be abbreviated any number of ways such as regex, regexp or regexes.
We will just call it "regex"
Regex (as per Wiki) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
Using a regex expression such as reg(ular expressions|ex(p|es)?)
we are wanting to search for "regular expression(s)", "regex", "regexp", and "regexes". By using one expression we have covered 5 combinations of searches. Using plain text search we would have had to do a search 5 times.
As in the opening statement I mentioned Python in particular, however different programm applications invoke regular engines slightly differently and are not compatible with each other. So the regular expression engine used by JavaScript is invoked differently by Python. These are called "Flavours of Regex engines"
We need to know how regex works so that we can understand why a particular regex does not do what you intend it to do. With all the different flavours of Regex Engines, there are fundamentally two types:
text-directed engines
regex-directed engines
All modern flavours belong to the regex-directed engines because only regex-directed engines have very useful features such as "lazy quantifiers" and "backreferences" (Will come to those later)
A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next character in the string. A text-directed engine never backtracks.
This is a very important point to understand: a regex engine always returns the leftmost match, even if a “better” match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. Again, it tries all possible permutations of the regex, in exactly the same order. The result is that the regex engine returns the leftmost match.
When applying cat
to He captured a catfish for his cat.
,the engine tries to match the first token in the regex c
to the first character in the match H
.
This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the c
with the e.
This fails too, as does matching the c
with the space
.
Arriving at the 4th character in the string, c
matches c.
The engine then tries to match the second token a
to the 5th character, a.
This succeeds too.
But then, t
fails to match p
. At that point, the engine knows the regex cannot be matched starting at the 4th character in the string.
So it continues with the 5th: a
. Again, c
fails to match here and the engine carries on.
At the 15th character in the string, c
again matches c
. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that a
matches a
and t
matches t.
The entire regular expression could be matched starting at character 15. The engine is “eager” to report a match. It therefore reports the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any “better” matches. The first match is considered good enough.
In this first example of the engine’s internals, our regex engine simply appears to work like a regular text search routine. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works has a profound impact on the matches it finds. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.
The most basic regex is a single character such as a. A string "Jack is a boy" it matches the 1st 'a' after the 'J'. To match the other 'a' we would need to tell the regex engine to carry on searching - in JavaScript they use flags as in \ \ g where g=global, ie search the entire file. In Python we would need to do re.findall
Similarly the regex cat matches 'cat' in "About cats and dogs". The regular expression consists of a series of 3 literal characters. The engine is saying "find a 'c' followed immediately by an 'a' and immediately followed by a 't" (Regex is case sensitive unless stipulate otherwise)
There are 12 characters with special meanings and are called "METACHARACTERS". These metacharacters add more functionality and meaning when matching strings. They are:
METACHARACTERS
DESCRIPTION
EXAMPLE
\ Backslash
Signals a special sequence (can also be used to escape special characters)
\d
^ Caret
Starts with
^hello
$ Dollar Sign
Ends with
world$
. Period
Any character (except newline character)
h..o
| Pipe
Either or
stay | go
? Question Mark
Preceding character is optional
colou?r
matches color and colour
* Asterix
Zero or more occurrences
aix*
+ Plus
One or more occurrences
aix+
( ) Parenthesis
Capture and Group
[ ] Square Bracket
A set of characters
[a-m]
{ } Curly brace
Exactly the specified number of characters
{2,4}
If you want to use any of these special characters as "literal" you will need to escape them with a backslash. For eg if you want to match 1+1=2
, You need to escape the + (plus) sign. so the regex becomes 1\+1=2
The brace { and the closed square bracket ] are seen as a literal characters unless they have a closing brace such as {2,3}, and is part of a character set [ ] ,so don't normally have to escape them.
The backslash should not be used to escape any other characters as they could become a regex token like \d
which is used to match a single digit from 0-9
^ Caret
Here you tell the regex engine to match only ONE out of several characters. If you use gr[ae]y
this could match either 'gray' or 'grey'
To specify a range we use a hyphen [0-9].
This matches a SINGLE digit
We can also have multiple ranges [a-zA-Z0-9]
where this is a SINGLE letter either a lower/upper/ case letter OR SINGLE digit between 0 and 9.
[^ ] Negated Character Classes
In most regex flavours the only special characters or metacharacters inside a class are the closing barcket ], the backslash \, the caret ^, and the hyphen - The usual metacharacters (see above) are normal characters inside a character class and do not need to be escaped by a backslash. To search for a star or a plus, use [*+]. Your regex will work fine if you escape regular metacharacters INSIDE a class, but doing so will significantly reduce readability. If you want to include a backslash as a normal character, you will need to escape it with another \ [\\x] matches a backslash OR an x The closing bracket ], the caret ^, and the hyphen can be included by escaping them with a backslash, or by placing themin a position where they do not take on their special meaning. To include an unescaped caret as a literal, place it anywhere excpet right after the opening bracket (which then makes it mean "negate") [x^] matches either an x or a caret. You can include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. This [ ]x] matches a closing bracket or an x. [^]x] matches any character not a closing bracket or an x. The hyphen can be included right after the opening bracket, or right before the closing bracket or right after the negating caret. Both [-x] and [-x] match a hyphen or an x. [^x-] and [^-x] matches any character NOT a hyphen and NOT an x
If you repeat a character class by using the ?, * or + operators, you are repeating the entire character class. YOu are not repearing just the character that it matched. [0-9]+ can match 837 as well as
You can use special character sequences to put non-printable characters in your regex
Carriage return means to return to the beginning of the current line without advancing downward. Commonly escaped with\r
Line feed (new line) means to advance downward to the next line and commonly escaped with a \n
Form feed means advance downward to the next "page". It was commonly used as page separators, but now is also used as section separators. (It's uncommonly used in source code to divide logically independent functions or groups of functions.) Text editors can use this character when you "insert a page break". This is commonly escaped as \f
Tab \t
to match a tab character
CHARACTER
DESCRIPTION
[ ]
A set of characters
\
Signals a special sequence (also used to escape special characters)
.
Any character except new line
A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks to a previous position in the regex and the subject string where it can try a different path through the regex. This tutorial will talk a lot more about backtracking later on. Modern regex flavours using regex-directed engines have lots of features such as and that allow you to control this backtracking.
Typing a caret after the opening square bracket matches any character that is NOT in the character class.
Unlike the. (period),
negated character classes also match (invisible) line break characters. If you DON'T want to negate the line breaks [^0-9\r\n]
matches any character that is NOT a digit or line break (new line)
q[^u]
does NOT mean "a q not followed by a u"
but rather " a q followed by a character that is NOT a u"
. It does not match a string Iraq. It DOES match the q AND the space after the q in Iraq is a country. The SPACE becomes part of the overall match because it is the "character that is NOT a u" that is matched by the negated character class. If you want the regex to match the q, and only the q, in both strings, you need to use : q(?!u)
. But we will get to that later.