REGEX

Allows us to check a series of characters for 'matches'

RESOURCES:

SETUP - ATOM & REGEX 101

ATOM

I will have the "How To' setup process in here. For the time being, I want to set up Atom for RegEx checking etc

From the GitHub file (Corey Schafer) above find simple.txt and snippets.txt and open them both up. Drag one file into the side of the other and a side window will open up

REGEX 101

Some Basic Expressions

We will notice that the expressions that have a lower case and uppercase letter means that the upper case of the respective letter negates what the lowercase character does. For example \w represents all word characters from a-z, AZ, 0-9 _ (and the underscore)

Character Classes

Period (.)

Type in any of the above expressions (Here I am trying the period (dot) ) and we can see it matches any character except a new line.

If we wanted to search for an actual "dot" we would have to 'escape' it and this is done same as in python with a backslash \ and then the 'dot' \.

\d - Digit (0-9) and \D

Matches any digit

With the \D - matches anything not a digit

\w and \W - Word Character and Not a word Character

\s Whitespace (space, tab, newline) and \S NOT Whitespace

Anchors

They don't match any characters but rather match invisible positions before or after characters

\b = word boundary

There is a word boundary at the start of the word Ha and another boundary after the white space.

If we placed the /b AFTER Ha ( Ha/b) we will have two word boundaries - the Ha at the beginning and the Ha at the end

\B boundary

With \B NOT a word boundary starting from the left - IF we did a Ha\B (from right to left NOT a boundary)

^ Caret - Beginning of a String and $ End of a string

Quantifiers

1) * ------> 0 or more 2) + ------> 1 or more 3) ? -00---> 0 or One 4) {3} ----> Exact number, 3 in this case 5) {3-4} --> Range of numbers (min, max)

Practical Examples

OK - Lets say we wanted to match a couple of phone numbers.

We cant just type in a literal search like we did at the beginning, we now have to match a PATTERN. From the telephone numbers 321-555-4321 and 123.555.1234 we can see that we have sets of 3, 3 and 4 digits either separated by a dash or period (dot) From this example we wont be able to use the literal characters, but rather the meta-characters

We "could" us the literal characters and create something like \d\d\d.\d\d\d.\d\d\d\d

The three \d's represent any one digit in a row followed by a period (any character, in this case covers a period or a hyphen) then another 3 digits, any character etc etc...

Lets now try to match the period and hyphen exactly instead of using the literal period (matching ANY character)

To match exactly we need to use the "character set" [ ] between 2 square brackets
\d\d\d[- .]\d\d\d[- .]\d\d\d\d
[8-9]00-\d{3}-\d{4}
- A QUICK SIDETRACK:
  - Lets say we wanted to match numbers beginning with 800 and 900 numbers such as 800-555-4321 and 900-555-4321 (812, 876, 977 943 etc)
  - [89][0-9]+[.-]\d{3}[.-]\d{4}
- Another sidetrack:
  - cat
  - mat
  - pat
  - bat
    Lets say we want to match every word ending with at but not matching bat
    We can use a character set [ ] with a caret - REMEMBER, a caret in [ ] means "matches characters "NOT" in the character set so:
    [^b]at - does not match "b" but matches all other 3 letter words ending in at

Lets say we want to match the following names below includung the whitespaces as well as some have the period after their name and others not:

Were going to start matching the Mr first:
- Mr\.This highlights Mr. Schafer and Mr.T but not Mr Davis as there is no period after his name.
- Mr\.?
- The question mark ? is saying "we can have 0 periods there or 1. This now highights Mr SmithMr\.?\s[A-Z]The \s matches the white space between the non period or period after the name and then we will add the [A-Z] indicating the 1st uppercase letter. We have in Mr T's case come to the end of his letters but we still need to match the other characters of the rest of the surnames. If we add the follwing to the end: Mr\.?\s[A-Z]\w+ only Mr Schafer and Mr Smith are matched - not Mr T
- This is because we have add in the \w+ which th e+ means "1 or more" charcters (we used the \w) - rather we use the * as this starts from 0 or more words characthers

Mr\.?\s[A-Z]\w*

OK we still need to match the Ms and Mrs names

Lets use a GROUP ( )which allows us to match several different patterns

M(r|s|rs)\.?\s[A-Z]\w* M(r|s|rs)\.?\s[A-Z]?\w+

Match E-Mails

[a-zA-Z0-9.-]+@[a-zA-Z-]+.[\w.]+ I also chnaged the [a-zA-Z] for \w

[\w.-]+@[\w-]+\.[\w.]+ --------> I think mine is better than Corey's - Yeah baby :)

Match URL's

Capturing Just The Domain Name and TLD

What we can do now if we want to just capture the domain name and TLD (like google.com) without the bumf of the http or s or www etc.

We can divide the regular expression we created into groups: We do this by using the ( ) curved brackets. So now the regular expression will now look like:

https?://(www\.)?(\w+)(\.\w+)

Now we have 3 groups

Group 1 - the optional www
Group 2 - the domain name
Group 3 - the TLD

We also 'HAVE' another group, GROUP 0 which is an implicit group and makes up everything we captured which is the ENTIRE URL (http, : , // plus the groups 1,2 and 3)

Back Reference

The back reference is a reference to our captured group, so with ATOM it has the ability to "REPLACE" our matches:

Lets reference the groups:

1st group (www\.) as Group 1: $1 - Here $1 = the 1st group etc
2nd group (\w+) as Group 2: $2
3rd group (\.\w+) as Group 3: $3

They usually use a backslash \1 , but ATOM uses a $ sign

So from the Group 1: $1 - the output is above which includes the option of not having the www and also the ones that DO have www The Groups 2 and 3 are below:

Using the Groups one can convert these to a cleaned up version without the http or www etc by just replacing the matches with the domain name (group2) and TLD (group3)

PreviousGREP, EGREP & REGEX NextREGEX2

Last updated 4 years ago

Was this helpful?