Novice to Advanced RegEx in Less-than 30 Minutes + Python

James Briggs · Beginner ·🧠 Large Language Models ·5y ago

Key Takeaways

This video tutorial covers Regular Expressions, an essential skill for coding and Natural Language Processing, using Python and RegEx tools, and demonstrates metacharacters, quantifiers, capture groups, and pattern matching techniques.

Full Transcript

hi and welcome to the video today we're going to look at regex which is short for regular expressions this is essentially the d factor standard for parsing text so what we're going to do in this video is run through the basics of regex first so stuff like how to define digits white space how to use quantifiers to tell us how many digits or white space or any other character we want to include how to use capture groups and character classes and also how we can use boundary definitions so how we define the start of a line the end of a line or boundaries of a word and we should move through these quite quickly because they're not very difficult they're pretty straightforward and then we can move on to what i think is the more exciting interesting stuff which is a little more advanced so these are things like look ahead or look behind searches modifiers and conditionals which are essentially like if else statements your reject code which is pretty interesting so we're going to work through all of that in this video we're going to code through a few examples in python we're also going to use regex 101 which is like an interactive debugger or regex building tool that we can use so it should be pretty interesting and let's jump straight into it okay so we're going to start in regex 101 and we're just going to have a look through a few of the metacarrots so let's say we have this string i'm just going to make it up we have a few letters in here some numbers and two other characters as well now we can obviously directly match these by actually writing out the exact characters but generally we're not going to want to do that if we're using regex so this is where we start to use meta characters so we'll just go through the most common ones we have digits this will match any number so anything between 0 and 9 and we can inverse that with a uppercase d and then that will match anything that is not a digit we have w which will match any word character and then we can also reverse that again by capitalizing it so all of these method characters we can usually uppercase them and it will reverse what they're doing do white space with s in this case we don't actually have any white space so let's add some in and it will also match new lines as well so you can't see this highlighted but if you look up here you can say two matches then if i add that new line in it goes up to three then we'll do anything but white space we just uppercase that again and then this one is a little bit of a special one this one matches any character except from new lines so this is not matching our new line here we can't obviously can't uppercase this one but i suppose the opposite would actually just be a new line character like this so let's switch over to python and see how we would do this so we import re which is the regex module and we'll go through these a little bit in more depth later but for now i'm just going to do re.find all and in our first argument here we put the pattern that we are going to use to search so in this case it would be backslash n although we're not going to use that we're going to use backstress d for any digits and then let's just pull this one in okay and we return one zero zero zero okay so these four characters here which is exactly what we would get here okay so that's cool now in the case of this full stop what if we would like to actually match a full stop and just the full stop to do that we actually escape the meta character using a backslash just like that okay so that's it for meta characters let's move on to quantifiers so quantifiers essentially allow us to match a specific number of characters so as of yet we've only been matching one at a time so here we are matching four characters but we're only matching one character at a time four times one two three four whereas quantifiers allow us to write our pattern and then add this quantifier to specify how many times to actually match that so the first of those is the one or more quantifier so this will match that pattern one or more times just a plus sign we always have zero or more quantifier so this is matching that pattern zero times or more times okay and something that we can also add in here as you'll see this is matching it as many times as possible but maybe we actually want to limit the number of times i'm matching something and this is the difference between what is called greedy and lazy quantifiers so at the moment we have a greedy quantifier so it's saying one or more times and it's going all the way up to four okay which is as many characters as it can fit into its pattern but we want it to not do that and instead be lazy and simply pick up as few characters as possible that match the criteria we can just add a question mark onto the end and then we're back to matching just one because it's one or more and we are limiting it to the minimum number of matches there so keep that in mind and we'll just quickly go over again uh towards the end just a little bit we also have the once or none so let's write a new test string here so here we have a few words and we'd like to match all of the words so what we can do is this and here we're kind of matching all the words but there's this one in the middle where we have a hyphen in the middle and this is something that will happen quite often and ideally we also want to put good hearted as a single word so we could do this but then we're only matching that single good-hearted part so instead we add a once-on-non quantifier okay so now we're matching that word as well now if we also want to match the a because you can see here it's not matching because we're expecting at least two word characters because we have w here and w here we can just add a zero or non quantifier onto the end there and now we're getting all of our words together including the hyphenated words we can also specify a specific quantity which we do like this so here we are getting three word characters at a time you can see here we're not specifying three characters that make a word we're just saying three characters so here we're getting multiple matches for single words which is fine because we haven't specified that we'll go over how to define word boundaries later and we can also turn this into a range so let's go three two five okay so now we're matching a minimum three characters and a maximum five now you might have guessed this but we can actually just remove one of these numbers to get less than five now if this doesn't work for you just make sure that you are using the python flavor uh because for other lang languages this might change and if you're on pcre for example this won't work so change back to python and we can also do three or more as well now i think this is a good example with our lazy quantifier so here we're matching between either three up to five characters at a time if we had a lazy quantify it's always going to limit that as much as possible so we're going to go down to three okay so you can see here that it's limiting how many characters it's including in there it's getting lazy rather than greedy okay so let's write out a new example here okay so a few unexpected words are to be expected so this is a good example of where we can use capture groups so anything contained within round brackets will create a capture group so a capture group is simply a fancy way of saying treat everything within these brackets as a single unit so you can see here i only put these dots in as a filler but it actually matches because those dots mean anything so three anything is in a row and the this is matching you know basically anything so we have all these matches here and it's treating those as a unit so it's doing three anything's and matching that and then moving on to nets through anything now what we can do is we want to match unexpected and expected so we want to match the word with or without its negative prefix so we can add expected here but here we're only getting expected we're not getting the on from unexpected so we just add this so now we're getting unexpected but we're not getting expected because we have specified okay we want you and here we want this catch group so all we need to do is actually make this optional by adding a zero or one quantifier and there we go we are now capturing unexpected and expected so let's have a quick go at this in python see what it looks like okay and we run this and now we find that we are only seeing on which is probably not what we're expecting the reason this happens is that find all tries to match capture groups which is exactly what we have here so what we can do is modify this capture group to make it a non-capturing group while still maintaining this behavior of zero or one so all we do to do that is add a question mark inside followed by colon and then here we are capturing everything again so that's just a little bit of a strange behavior to watch out for now we can also add a or logic to our capture groups so maybe we want to capture anything where we are saying expected with a negative prefix and that can either be not expected or unexpected and we want both of these to match now to do this we actually just add a pipe into our capturing group and then we add not like so and now we are matching both non-expected and unexpected okay so that's it for capture groups and let's move on to character sets so the syntax for character sets is kind of similar to the syntax for capture groups in that we use brackets but this time they're square brackets instead and you can see these kind of like a list so anything we put within here will be treated as a character to match but unlike capture groups it's not treating them all as a unit so if we put on in here it's actually just matching either u or n and we can put unexpected and it's not going to match unexpected as a unit it's just going to match each one of those words within those square brackets so let's return to our earlier example so earlier on we were matching all of the digits in our string so what we could do is write out all of the digits like this and we get the exact same effect obviously this is quite long so what we can do instead is write this with dash in the middle and this is any character within the range of zero to nine we can also add letters to this so a to z for example and you might also think okay we can also add these hyphens in right but obviously we are using these hyphens to define our ranges so in order to add a hyphen in here we need to use backspace to escape it and now we are matching the full string and if you want to match full string as a whole of course we just add our quantifier now let's move on to boundaries so i'm just going to write out a new string for this okay so here i want to show you the start string and the string boundaries so start string is using this carrot character so here if we put if it it's only going to match if it at a start here it's not going to match if it's here as well and if we remove the character it does okay so we add this character to specify that we only want to search from the start very start of our string now the equal and opposite of the starter string character is the end string character and that is a dollar symbol so let's rewrite this okay and we want to look for example and here with the dollar symbol we only match the final example rather than both of them so you can see there i'm going to go back to one of the earlier examples again now okay so here i also want to show you the word boundary so the best way to identify word boundary is not by using know for example white space because yes that does work in a lot of cases but it doesn't work if we have a comma full stop hyphen or anything like this so what we can do instead is use backslash b and this identifies every single word boundary within our text as you can see from the pink lines so then we can use that to capture any of our words and now quite easily we've captured every single word and we're pulling them out in a more efficient way than if we had tried to write you know s or if we'd have gone with a grouping like this and added all these different things all we need to do is add a word boundary okay so now we'll move on to some of the what i think are the more interesting and definitely a bit more advanced methods in regex so the first of those is the look ahead and look behind assertions so if we have this string here we have two hello worlds one of them is preceded by a one colon the other preceded by two and a colon now what if we want to match hello world but we only want to match hello world if it is preceded by a one and a colon but we don't want to include that one in a colon within our pattern because if we want to do that we would just write this but this will return the entire string so i'll show you over here okay so we're returning the full string what if we only actually want to pull out this hello world well we could go for hello world but then we're returning both we don't want to do that we only want the first one so to do this we use a look behind assertion so this means that we are looking behind our pattern which means anything preceding it and we are asserting that there is this other pattern there so we do this with this pattern here okay and anything we place in between this equal sign and this closing bracket is included within our assertion pattern okay so now we are matching just this first hello world so if we go and take this and put it into our code here we will return just the first hello world now on the other hand maybe we want to match something that comes after our pattern and to do that we use a look ahead of section which as you probably guessed is basically exactly the same but on the other side it does use a slightly different syntax but other than that there's really no different so in our case we're going to search for this comma so in between the equal sign and the closing bracket here that's where we put our pattern and here again we're matching this hello world i'm going to put this in python and of course we'll just get the exact same thing now on the other hand maybe we don't want to assert that something is in front or behind our pattern we actually want to assert that something is not there so what we can do is we can make this a negative lookahead by replacing this equal sign with an exclamation mark and now we are looking for the hello world that is not followed by a comma which is obviously this one and again with the look behind we just modify that as well so whereas the look behind looked like this we just again remove the equal sign and replace it with a exclamation mark and then we get the second hello world okay so that's it for the assertions let's move on to modifiers so we can actually see we have a few modifiers here and these are essentially ways of modifying the behavior of our entire regular expression now obviously python doesn't have this little site rejects options here so what we do instead is we can either do a an inline modifier like this one and let me just give you a good example quickly so we'll just write out a string that includes a new line character in the middle okay and we're going to match character and then anything following that character now if you remember this anything meta character does not actually match anything it matches anything it's set from new lines so in this case it is not matching here because we are expecting new line now if we open this we can see that we have this single line dot matches a new line so we add that and now it's changed the behavior of our regex and the anything meta character also matches new line characters now let's remove it from here and we can add it in line like this so here we're just adding the s within this global modifying function and we can also add other modifiers as well if you want so you just all you need to do is add the letter that represents that global modifier and add it within those brackets now if we take that over to python and we add this in so here is our inline modifier yep it works that's great if we get rid of this it doesn't match anymore and that's what we would expect now in python we can also add modifiers within the function itself so what we do is we add re dot and then the capital of the modifying flag and there we go it matches again so there's a few different ways that you can do it in python now let's move over to conditionals this is the last one we're going to work through and these are probably a little more complex in my opinion to actually read and understand so i'm going to use a few different examples here i'm just going to quickly make this up okay so we have these three lines each of them has something in common now what i want to do is enter a condition within a capture group which is either true or not now i don't really want to specify that i need this condition within my rejects because if it isn't there i want to search for something else and to do that we add this group here and this is basically our if else group so i'm just going to put i here for now i'll explain that in a moment and this is how if the condition is true then we also want to search for this if the condition is not true then just search for this instead so our condition here is within a capturing group and this token here refers to the index of our capture group so in our case we only have one capture group and you can see here even highlights first capturing group so this actually needs to be one and then that essentially links our condition within this capturing group to this if else statement and we can see that now okay this condition is not true because we haven't written condition anywhere so it's going into the if clause and it's saying okay if it's true which it isn't so for this right but it isn't true so we are actually searching for buy so it is producing a match because all we need to match to is buy but in reality we actually do want to search for a condition which is going to be hello and here now we are matching two things we're matching this because hello is true here and if hello is true our if else statement says okay now we need to match space world which we do right here and on this line just like before we're finding okay hello is is not true it doesn't say hello here so this is not what we want to look for we actually want to look for this which is our else statement and we do in fact have by here so that again does match so i mean that's everything on the regular expressions and i just want to quickly go through what the difference is between re match search and find or in python so let's remove these so i'm just going to write a string very quickly okay we're just going to use this as our example now if we do re match you remember before we had this starter string character re.match essentially is like putting this character in front of whatever you type within it so what i mean by that is if we do re.match hello we will get a match so we also as well we need to put dot group um after we use rematch or our research something to be aware of okay and yeah okay fine we would expect that because we're searching through and yep there's a hello here there's a hello here of course it's gonna match your hello that's fine but if we put world we don't match anything okay so it's a non-type which means nothing has been matched and the reason for that is that match automatically adds the starter string token in there so if we put world to start here this would work okay but obviously before we didn't have it there so it didn't work so that's what re match does it also just returns you one match unlike find all you you remember we're getting a list of matches now re.search doesn't specify that we only need to look at the first part of the string instead our research looks through the whole thing so if we search world we actually do get a response because we're not specifying it needs to be here okay so that's great but you'll also notice that we are only returning one thing here and that will not change if we add hello okay we're still just returning one item so what re.search does is it comes through here it searches a whole string but it only searches for the first instance so it gets to here it says okay i found hello and then it returns that match it doesn't go any further and finds anything else and that's where find all is a little bit different so find all we can't use group here we just print out x does go through and find everything so that's it for this video i hope it's been useful we've been you know through a lot of regex um so don't worry if it can blew your mind a bit if you're new to this it is quite a lot but nonetheless regex is super important it's i would definitely recommend getting familiar with it if this is new to you it's it's an incredibly useful skill no matter what you are specializing in as long as you code you're probably going to use rejects so that's the video on regex i hope you enjoyed and as always thank you for watching bye

Original Description

A full tutorial covering everything you need to know about Regular Expressions - an essential for anyone learning to code - and even more so for anyone interested in Natural Language Processing. This video includes: - metacharacters - quantifiers - capture groups - using capture groups in Python - character sets - look-ahead and look-behind assertions - negative look-ahead and look-behind assertions - inline modifiers - passing modifiers as function parameters in Python - conditionals (if-else statements for RegEx) - re.match - re.search - re.findall We cover all of this in-depth in this tutorial, incl. examples all the way through on RegEx101 (an interactive debugging/regex building tool) and also in Python. 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 15 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches the fundamentals of Regular Expressions and their application in Python, covering topics such as quantifiers, capture groups, and pattern matching, with hands-on coding examples and demonstrations.

Key Takeaways
  1. Add a quantifier to a pattern to specify how many times to match
  2. Use a lazy quantifier to limit matches to the minimum possible
  3. Create a capture group to treat everything within round brackets as a single unit
  4. Use character sets to match a list of characters
  5. Use boundaries to specify the start or end of a string
  6. Use look ahead and look behind assertions to match patterns with conditions
💡 Regular Expressions are a powerful tool for text processing and pattern matching, and understanding their basics is essential for effective coding and Natural Language Processing.

Related AI Lessons

I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Learn how to effectively use AI like ChatGPT to improve your life by changing your approach
Medium · AI
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Learn how to effectively use ChatGPT to solve personal problems by changing your approach
Medium · ChatGPT
Claude Sonnet 5 Is Here: Why It Might Replace Your Opus Subscription
Learn about Claude Sonnet 5, a new AI model that offers near-flagship performance at a lower price, and its potential to replace Opus subscriptions
Medium · Programming
Introducing Claude Sonnet 5 on AWS: Anthropic’s most capable Sonnet model
Learn about Claude Sonnet 5, Anthropic's most advanced Sonnet model, now available on AWS, and how it delivers top-tier intelligence for coding, agents, and professional tasks
AWS Machine Learning
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →