REGEX
Allows us to check a series of characters for 'matches'
Last updated
Was this helpful?
Allows us to check a series of characters for 'matches'
Last updated
Was this helpful?
I will have the "How To' setup process in here. For the time being, I want to set up Atom for RegEx checking etc
From the GitHub file (Corey Schafer) above find simple.txt
and snippets.txt
and open them both up. Drag one file into the side of the other and a side window will open up
We will notice that the expressions that have a lower case and uppercase letter means that the upper case of the respective letter negates what the lowercase character does. For example \w represents all word characters from a-z, AZ, 0-9 _ (and the underscore)
Type in any of the above expressions (Here I am trying the period (dot) ) and we can see it matches any character except a new line.
If we wanted to search for an actual "dot" we would have to 'escape' it and this is done same as in python with a backslash \ and then the 'dot' \.
Matches any digit
With the \D - matches anything not a digit
They don't match any characters but rather match invisible positions before or after characters
There is a word boundary at the start of the word Ha and another boundary after the white space.
If we placed the /b AFTER Ha ( Ha/b) we will have two word boundaries - the Ha at the beginning and the Ha at the end
With \B NOT a word boundary starting from the left - IF we did a Ha\B (from right to left NOT a boundary)
1) * ------> 0 or more
2) + ------> 1 or more
3) ? -00---> 0 or One
4) {3} ----> Exact number, 3 in this case
5) {3-4} --> Range of numbers (min, max)
OK - Lets say we wanted to match a couple of phone numbers.
We cant just type in a literal search like we did at the beginning, we now have to match a PATTERN. From the telephone numbers 321-555-4321 and 123.555.1234 we can see that we have sets of 3, 3 and 4 digits either separated by a dash or period (dot) From this example we wont be able to use the literal characters, but rather the meta-characters
We "could" us the literal characters and create something like \d\d\d.\d\d\d.\d\d\d\d
The three \d's represent any one digit in a row followed by a period (any character, in this case covers a period or a hyphen) then another 3 digits, any character etc etc...
Lets now try to match the period and hyphen exactly instead of using the literal period (matching ANY character)
To match exactly we need to use the "character set" [ ]
between 2 square brackets
\d\d\d[- .]\d\d\d[- .]\d\d\d\d
[8-9]00-\d{3}-\d{4}
A QUICK SIDETRACK:
Lets say we wanted to match numbers beginning with 800 and 900 numbers such as 800-555-4321 and 900-555-4321 (812, 876, 977 943 etc)
[89][0-9]+[.-]\d{3}[.-]\d{4}
Another sidetrack:
cat
mat
pat
bat
Lets say we want to match every word ending with at but not matching bat
We can use a character set [ ] with a caret - REMEMBER, a caret in [ ] means "matches characters "NOT" in the character set so:
[^b]at
- does not match "b" but matches all other 3 letter words ending in at
Lets say we want to match the following names below includung the whitespaces as well as some have the period after their name and others not:
Were going to start matching the Mr first:
Mr\.
This highlights Mr. Schafer and Mr.T but not Mr Davis as there is no period after his name.
Mr\.?
The question mark ? is saying "we can have 0 periods there or 1. This now highights Mr Smith
Mr\.?\s[A-Z]
The \s matches the white space between the non period or period after the name and then we will add the [A-Z] indicating the 1st uppercase letter. We have in Mr T's case come to the end of his letters but we still need to match the other characters of the rest of the surnames. If we add the follwing to the end:
Mr\.?\s[A-Z]\w+ only Mr Schafer and Mr Smith are matched - not Mr T
This is because we have add in the \w+ which th e+ means "1 or more" charcters (we used the \w) - rather we use the * as this starts from 0 or more words characthers
Mr\.?\s[A-Z]\w*
OK we still need to match the Ms and Mrs names
Lets use a GROUP ( )
which allows us to match several different patterns
M(r|s|rs)\.?\s[A-Z]\w*
M(r|s|rs)\.?\s[A-Z]?\w+
[a-zA-Z0-9.-]+@[a-zA-Z-]+.[\w.]+
I also chnaged the [a-zA-Z] for \w
[\w.-]+@[\w-]+\.[\w.]+ --------> I think mine is better than Corey's - Yeah baby :)
What we can do now if we want to just capture the domain name and TLD (like google.com) without the bumf of the http or s or www etc.
We can divide the regular expression we created into groups: We do this by using the ( ) curved brackets. So now the regular expression will now look like:
https?://(www\.)?(\w+)(\.\w+)
Now we have 3 groups
Group 1 - the optional www
Group 2 - the domain name
Group 3 - the TLD
We also 'HAVE' another group, GROUP 0 which is an implicit group and makes up everything we captured which is the ENTIRE URL (http, : , // plus the groups 1,2 and 3)
The back reference is a reference to our captured group, so with ATOM it has the ability to "REPLACE" our matches:
Lets reference the groups:
1st group (www\.)
as Group 1: $1
- Here $1 = the 1st group etc
2nd group (\w+)
as Group 2: $2
3rd group (\.\w+)
as Group 3: $3
They usually use a backslash \1 ,
but ATOM uses a $ sign
So from the Group 1: $1 - the output is above which includes the option of not having the www and also the ones that DO have www The Groups 2 and 3 are below:
Using the Groups one can convert these to a cleaned up version without the http or www etc by just replacing the matches with the domain name (group2) and TLD (group3)