Regular Expressions - Computerphile

Computerphile · Intermediate ·🔧 Backend Engineering ·6y ago

Key Takeaways

Professor Brailsford discusses Regular Expressions, a fundamental concept in Computer Science, and provides an in-depth explanation of its applications and uses, including pattern matching and text processing, with a focus on practical examples and real-world scenarios, utilizing tools such as regex editors and programming languages like Python and Java.

Full Transcript

today well today as ever we're going to listen to our subscribers and a lot of them keep saying well you haven't really done regular expressions head-on you keep on mentioning them as though we know all about them but we don't it would be nice to have a sort of regular expressions explained as it were so although i'm going to do my best to do regular expressions completely as a one-off no previous experience needed for those of you who look at it and think oh come on this relates to something else we will give you the link to go to the previous episodes and you can get more details about just how it links into other things to those of you who think i must be infinitely old and personal friends of all these people well i am of some of them but not all and today's hero is called stephen claney he's the man who in the mid-1950s invented regular expressions what claney wanted was a way that he didn't have to draw diagrams like this one but that he could abbreviate what the automaton was doing so i'm going to number these states zero one and two and i'm going to label this one over here as being the finish stage and this is going to be an incredibly simple string fragment a fragment of a string of characters that we're going to recognize this is the automaton way of depicting it at the start you are in state zero and we're going to say that in state zero as you are analyzing your input string we'll put some input strings down here that this thing will be able to recognize if what you see at the start of your string at the moment not been coped with is a letter a then you're going to state one if let's take the simplest one i want to recognize if in state one you've go to the a if the next character is c then that is acceptable so i write a c over that transition arc here and that gets me into state two usually the finished stage is sometimes distinguished by drawing a sort of double circle like that so you just glanced and say ah that's it so yeah what i've drawn there certainly recognizes the string ac frankly electronics engineers and so on were using these not in the late 50s in the early 50s they use them for creating and understanding what they called state machines and as i say clainey sort of like the idea like the pictorial notation but wanted to see if he could get it to be more compact so let's just finish the story by saying that if you come around on this loop and re-enter state one you can do so by recognizing a b in the string so we've already shown that it can recognize ac you just come here you start you get an a you see a c and you're finished but equally as we now see it would also accept abc start here cna if the next character in your string is a b fine you accept it but you come back into state one although it doesn't explicitly involve a stack this is like a sort of iterative recursive re-entry into the same state but the only way out of that state to the finish is fire accepting a c at the end of the string for number two you can keep going around this b loop as long as you want it will certainly accept a b b c a three b c and so on and effectively what klinger came up with as being a good way of talking about these things over the phone and not having to draw diagrams is to say in his regular expression notation that it accepts a b star c and this is the first bit of kleener invented notation that star which as we almost all of us now know means zero or more of there we are then the strings that that automaton accepts is a followed by zero or more b's finishing with a c and the nice thing about this regular expression notation is just look at how compact it is it is just so much easier to handle in a program than trying to do elementary computer graphics and literally draw yourself automata although for pictorially i think everybody liked this way of sketching out what your automaton would accept those of you some of you may have seen it if not follow the link out to the things we did about three years ago you'll find in there an automaton that accepts 25 pence in order to issue a parking permit it's all done with quite complicated but locally simple little transitions on either coins or characters or whatever so just to underline and re-emphasize why this regular expression notation of cleaners was seized upon because it was so compact is let's complete the triangle one side of the triangle is the automaton approach another side of trial we've now seen is the regular expression approach completely equivalent but if we look at the chomsky grammar approach done on the playlist would that be compact and nice not so much the rules on chomsky grammars in their purest form is that if you use a capital letter that is a so-called non-terminal symbol there'll be another left-hand side rule that develops it further whereas lowercase literally means what it says so this means well chomsky tended to call them all sentences a sentence in this limited language is a little letter a followed by anything that a b can be so i have to complete this by saying well what can a capital b become well again look at the automaton diagram to which it's completely equivalent what happens with the b is you can spit out a lowercase letter b or a b or you can have a much simpler rule that says a capital b can just become the letter c so there we are instead of one small line we've got three lines and if you're doing it the grammar way and you might say oh well can't you abbreviate that grammar a bit more it's it's awful having to do three lines like that it's very verbose uh yeah here's the uh here is the allowed shortening but it's not much is you could say all right i will allow myself the luxury of the or bar and put a c there so it's down to two lines now not three what it's saying is a b can be a lowercase b followed by recursively entering to b again or it can just be the letter c so we're down to two lines but it's still nowhere near as compact and nice as the regular expression is and what i must absolutely emphasized can't emphasize too much i should put double headed arrows all over the place here is that all of these things the automaton diagram the regular expression the grammar they are all completely equivalent to each other those of you who've been through mathematics courses will know your instructors go on and on and on about no i'm not asking you whether one is a subset of the other or one is contained within the other book there's a little bit extra no they are completely and totally equivalent there is not a single thing that one will accept that the other won't and vice versa so it's just a matter of convenience of the notation perhaps if you're a computer scientist for using in a programming context and i think the other thing that also had to be coped with is to say well all right you can do it as an automaton diagram you can do it as a regular expression you use a grammar regular expression is looking good because of the compactness of the notation but um are we sure that there aren't some snags in this process of saying they're equivalent well one snag that did occur and was recognized very early on in the late 1950s i mean all of this is that sometimes looking at it let's look at it from the automaton diagram point of view sometimes you get a situation where what you want is what's called non-deterministic in other words i'm happy here because there's only one exit from zero to get you into one it's labeled a how would it be if i took another arrow out of zero and said a could also lead you somewhere else so let's just draw up a little um diagram of what horrors that might be what happens and what would you do about it if you say well my regular expression let's call it e is actually either a b a b or i want to recognize a b b b now you might say well that's useless however you know how does that fit into anything it does illustrate though a very very important point which is that if you are trying to build an automaton a recognizer for this you've got two alternatives here you are at state zero and you want to go into possibly two completely different directions but they both begin a how do you choose for the moment i'm saying no cheating no you're not allowed to look ahead you just get given an a what do you do and worse still of course it doesn't end there it then goes on that whichever route you go you find that the next thing is you've got to recognize a b so the common factor if you like and it is like factorizing expressions in algebra you know x cubed minus one take out an x squared factorize it out at the front it's x squared x minus one that's sort of it's similar in principle to that now of course they do change a bit eventually because this one accepts an area here and then finishes up by accepting a b this one next one goes a b and another b and finally it gets down into the finish state as well okay well it looks trivial but what are you going to do here you are the programmer trying to implement this you look at the string and you say but there's two ways i can cope with this the most amazing thing about this is that two theoreticians in the late 50s won the turing acm award for saying it will always be possible to turn the non-deterministic one into a deterministic one now how would we do that on this case easy you start off with zero you say that first of all we'll accept you always do an a and then a b that's the common factor if you like for the start but you put your split point here that that continues with an a and that line continues with the b to get the two avenues that you see up there so we factorizing out the a and the b and the story was well it's all right for us humans who are really intelligent to see that you can do that but will it always be possible for these things to turn the non-deterministic one into a deterministic one where you force it through factorization first and the answer from dana scott and michael rabin in the late 1950s was it is always doable so long as you stick to simple finite state automata don't start monkeying around with extra ram or stacks just stick to them as they are you can always do it but it can get very very hairy so who was the first person who plunged headlong into this and said i'm gonna do it answer ken thompson our hero from unix bell labs and a whole lot by the late 60s he effectively was saying this regular expression stuff is great for pattern matching in editors i want to use it but this problem of non-determinism how do i cope with it and it is the most amazingly far-sighted piece of work you understand when you look at this why ken thompson is a legend he basically said i'm going i know it's not deterministic but i look at all possibilities and i'll start pre-compiling little bits of assembler code to cope with whichever one turns up on the day as it were and everyone was open-mouthed about this he also realized he said look yeah if you don't mind doing pre-processing you can actually you know michael and dana have shown us that it is possible yeah i know how to do that but on the other hand that could take some time because it's all very well doing a little toy example like this but non-determinism in a big real life automaton could be hell to disentangle what's an example of one of those then what like um well let me give you one example which sounds utterly innocuous but which i can give you sean a complete page that you can show to the fans about this brian kernighan wanted to get a regular expression recognizer for all of the key words and all of the constructs in his pick language which after all is a preprocessor for the unix t roth type setting language you just want to draw simple line diagrams so we've got primitives like circle line ellipse from two with dotted to show you the nature line all this kind of stuff it looks innocuous biggish but innocuous and uh he put it through one of these nfa to dfa let's stop it being non-deterministic we want fast recognition we want to build an engine don't care it has more states than the non-deterministic one but boy it's got to go off like a lamborghini right so be prepared to spend some pre-processing time you can put out his release note with pick which says the lex phase takes an e on 15 minutes on a vax 750 full stop be patient brian is always noted for his earnest hemingway terseness you know so that is what can come to haunt you is that deconvoluting tracing around everything to get it deterministic it is worthwhile if you are going to be putting huge pick scripts through this so you're hammering it over and over and over again you think i've got to get this really efficient otherwise we'll be here forever while we're using it on the other hand ken's usage of it in the unix editor ed took the pragmatic view this thing ed in the early days because stream editor lee mcmahon hadn't been invented yet just edie you can get away with very simple regular expressions because that's all that humans will use if you're feeding it stuff that's been prepared by another program that's when your pet schemes tend to die the death because stuff prepared by another program can exploit little wrinkles that cause you great difficulty but ken said no in ed i think i can get away with keeping it non-deterministic but preparing and look ahead it's one of the first examples of just in time compilation he took the attitude i'll look at which way the cookie crumbles try and work out which is the most likely and pre-compute pieces of fast assembler code that i can put in to actually execute these things and it was the most amazing flexible system it basically adapted to the input it is a good example of just in time compilation what ken says is i'll do it as i need it brian in pic couldn't say that because it's basically much more like a compiler and it's a compiler that's been driven by reams of externally provided input so two approaches to the same thing are you going to make it deterministic in all cases ahead of time are you going to cope with it as you go the man goes to town as we would say to town the man goes sounded to me 20 years ago when i first stumbled on this very much like yoda the jedi master for those of you coming into this cold and direct because you saw the word yoda and grepped over the entire universe

Original Description

Professor Brailsford on one of our most requested topics. Playlist of Videos the Prof mentioned: https://www.youtube.com/playlist?list=PLzH6n4zXuckpkgSrHX87sDCmEZSumytL3 EXTRA BITS: https://youtu.be/-U_ggjfLc1g https://www.facebook.com/computerphile https://twitter.com/computer_phile This video was filmed and edited by Sean Riley. Computer Science at the University of Nottingham: https://bit.ly/nottscomputer Computerphile is a sister project to Brady Haran's Numberphile. More at http://www.bradyharan.com
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Computerphile · Computerphile · 0 of 60

← Previous Next →
1 Follow the Cookie Trail - Computerphile
Follow the Cookie Trail - Computerphile
Computerphile
2 EXTRA BITS - Follow the Cookie Trail - Computerphile
EXTRA BITS - Follow the Cookie Trail - Computerphile
Computerphile
3 Musical Floppy Drives - Computerphile
Musical Floppy Drives - Computerphile
Computerphile
4 The Hair Algorithm - Computerphile
The Hair Algorithm - Computerphile
Computerphile
5 Getting Sorted & Big O Notation - Computerphile
Getting Sorted & Big O Notation - Computerphile
Computerphile
6 Quick Sort - Computerphile
Quick Sort - Computerphile
Computerphile
7 Hyper History and Cyber War - Computerphile
Hyper History and Cyber War - Computerphile
Computerphile
8 Entropy in Compression - Computerphile
Entropy in Compression - Computerphile
Computerphile
9 Original Elite on the BBC B - Computerphile
Original Elite on the BBC B - Computerphile
Computerphile
10 IP Addresses and the Internet - Computerphile
IP Addresses and the Internet - Computerphile
Computerphile
11 A Career in Video Games - Computerphile
A Career in Video Games - Computerphile
Computerphile
12 Error Detection and Flipping the Bits - Computerphile
Error Detection and Flipping the Bits - Computerphile
Computerphile
13 Programming BASIC and Sorting - Computerphile
Programming BASIC and Sorting - Computerphile
Computerphile
14 Birthplace of the World Wide Web - Computerphile
Birthplace of the World Wide Web - Computerphile
Computerphile
15 Punch Card Programming - Computerphile
Punch Card Programming - Computerphile
Computerphile
16 Programming Paradigms - Computerphile
Programming Paradigms - Computerphile
Computerphile
17 CERN Computing Centre (and mouse farm) - Computerphile
CERN Computing Centre (and mouse farm) - Computerphile
Computerphile
18 Error Correction - Computerphile
Error Correction - Computerphile
Computerphile
19 Home-Made Code - Computerphile
Home-Made Code - Computerphile
Computerphile
20 Security of Data on Disk - Computerphile
Security of Data on Disk - Computerphile
Computerphile
21 Gesture Controls - Computerphile
Gesture Controls - Computerphile
Computerphile
22 How Intelligent is Artificial Intelligence? - Computerphile
How Intelligent is Artificial Intelligence? - Computerphile
Computerphile
23 Encryption and Security Agencies - Computerphile
Encryption and Security Agencies - Computerphile
Computerphile
24 Virtual Machines Power the Cloud - Computerphile
Virtual Machines Power the Cloud - Computerphile
Computerphile
25 Hacking Websites with SQL Injection - Computerphile
Hacking Websites with SQL Injection - Computerphile
Computerphile
26 How Huffman Trees Work - Computerphile
How Huffman Trees Work - Computerphile
Computerphile
27 Cracking Websites with Cross Site Scripting - Computerphile
Cracking Websites with Cross Site Scripting - Computerphile
Computerphile
28 Cloud Computing (Cloudy with a Chance of Pizza) - Computerphile
Cloud Computing (Cloudy with a Chance of Pizza) - Computerphile
Computerphile
29 Texting Cabbage with a Recorder - Computerphile
Texting Cabbage with a Recorder - Computerphile
Computerphile
30 Hashing Algorithms and Security - Computerphile
Hashing Algorithms and Security - Computerphile
Computerphile
31 How YouTube Works - Computerphile
How YouTube Works - Computerphile
Computerphile
32 How NOT to Store Passwords! - Computerphile
How NOT to Store Passwords! - Computerphile
Computerphile
33 A New Golden Age of Video Games - Computerphile
A New Golden Age of Video Games - Computerphile
Computerphile
34 A Universe of Triangles - Computerphile
A Universe of Triangles - Computerphile
Computerphile
35 Cross Site Request Forgery - Computerphile
Cross Site Request Forgery - Computerphile
Computerphile
36 The True Power of the Matrix (Transformations in Graphics) - Computerphile
The True Power of the Matrix (Transformations in Graphics) - Computerphile
Computerphile
37 The Great 202 Jailbreak - Computerphile
The Great 202 Jailbreak - Computerphile
Computerphile
38 EXTRA BITS - Printing and Typesetting History - Computerphile
EXTRA BITS - Printing and Typesetting History - Computerphile
Computerphile
39 Triangles to Pixels - Computerphile
Triangles to Pixels - Computerphile
Computerphile
40 The Problem with Time & Timezones - Computerphile
The Problem with Time & Timezones - Computerphile
Computerphile
41 The Visibility Problem - Computerphile
The Visibility Problem - Computerphile
Computerphile
42 Lights and Shadows in Graphics - Computerphile
Lights and Shadows in Graphics - Computerphile
Computerphile
43 The Penguin Barcode - Computerphile
The Penguin Barcode - Computerphile
Computerphile
44 Typesetters in the '80s - Computerphile
Typesetters in the '80s - Computerphile
Computerphile
45 The Font Magicians - Computerphile
The Font Magicians - Computerphile
Computerphile
46 The Little Mac with the Big Bite - Computerphile
The Little Mac with the Big Bite - Computerphile
Computerphile
47 EXTRA BITS - More on the Original Mac at 30 - Computerphile
EXTRA BITS - More on the Original Mac at 30 - Computerphile
Computerphile
48 XP to Ubuntu with an 8yr old Hacktop - Computerphile
XP to Ubuntu with an 8yr old Hacktop - Computerphile
Computerphile
49 EXTRA BITS - Hacktop Real-Time Boot Comparison - Computerphile
EXTRA BITS - Hacktop Real-Time Boot Comparison - Computerphile
Computerphile
50 EXTRA BITS - Making a Bootable USB in Linux - Computerphile
EXTRA BITS - Making a Bootable USB in Linux - Computerphile
Computerphile
51 EXTRA BITS - Installing Ubuntu Permanently - Computerphile
EXTRA BITS - Installing Ubuntu Permanently - Computerphile
Computerphile
52 The Dawn of Desktop Publishing - Computerphile
The Dawn of Desktop Publishing - Computerphile
Computerphile
53 What is Bootstrapping? - Computerphile
What is Bootstrapping? - Computerphile
Computerphile
54 Reverse Polish Notation and The Stack - Computerphile
Reverse Polish Notation and The Stack - Computerphile
Computerphile
55 Home-Made Z80 Retro Computer - Computerphile
Home-Made Z80 Retro Computer - Computerphile
Computerphile
56 Should Everybody Learn to Code? - Computerphile
Should Everybody Learn to Code? - Computerphile
Computerphile
57 Programming in PostScript - Computerphile
Programming in PostScript - Computerphile
Computerphile
58 Heartbleed, Running the Code - Computerphile
Heartbleed, Running the Code - Computerphile
Computerphile
59 YouTube's Secret Algorithm - Computerphile
YouTube's Secret Algorithm - Computerphile
Computerphile
60 YouTube Search & Discovery - Computerphile
YouTube Search & Discovery - Computerphile
Computerphile

This video provides an introduction to Regular Expressions, a powerful tool for pattern matching and text processing, with practical examples and applications in Computer Science, including code optimization and data validation. The video is suitable for intermediate learners with a basic understanding of programming concepts. By the end of this video, learners will be able to apply Regular Expressions to real-world scenarios and understand its importance in Computer Science.

Key Takeaways
  1. Install a regex editor or IDE
  2. Write a simple regex pattern
  3. Test the regex pattern using sample data
  4. Apply regex to a real-world problem
  5. Optimize code using regex
  6. Validate data using regex
💡 Regular Expressions is a fundamental concept in Computer Science that can be used for pattern matching, text processing, and code optimization, and is essential for any programmer or developer to learn.

Related AI Lessons

Up next
This Cop Was Held Accountable For His Brutality! #police #lawyer
Hampton Law
Watch →