Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing
Key Takeaways
Builds Natural Language Processing foundations using tokenization with TensorFlow
Full Transcript
hi and welcome to episode 8 of machine learning foundations I'm Laurence Moroney from the Google AI team and I'm here to be your guide through the basics of machine learning up to now you've learned how machine learning works and explored examples in computer vision by doing image classification including understanding concepts such as convolutional neural networks for feature identification an image augmentation to avoid overfitting making your networks that little bit smarter we're now going to switch gears and we'll take a look at natural language processing in this video we'll take a look at how a computer can represent language and that's words and sentences in a numeric format that can then later be used to train neural networks this process is called tokenization so let's get started consider this word it's the English word listen and it consists of six letters we're used to reading it based on the sounds and putting those sounds together to form a word but how can a computer understand this word well one way as computers deal better with numbers than they do with letters is to assign a number to each letter a common coding format is called ASCII where common letters and symbols are encoded into the values from 0 to 255 it's useful in that only one byte is needed to store the value for a letter but it has been superseded by later encodings in order to give access to characters and letters beyond 255 in particular international characters but for the purposes of illustration we can stick with ASCII where for example the letter L is 76 I is 73 and so on so we now have the word listen encoded into six bytes one for each letter now this is a perfectly valid encoding and often when you use neural networks you'll see character encoding or sub word encoding and stuff like that they lead to things being a little bit more complicated and in these tutorials I'm going to do word based encoding and not the letter based that we just saw now why would I do this one reason is that if we're taking a word as a set of numbers unless we take the sequence of those numbers into account we can have two words sometimes with opposite ish meanings like this and they can have the same letters thus if we want to use character based encoding a computer can't tell the difference between these two words unless we have a sequence model and that's a little bit more complicated than we need to look into right now so let's consider a different encoding and that's a word based one that way each of these words can be represented by a single number and each number will be different there's also a nice hidden advantage to this which we'll see in a moment so consider this sentence I love my dog it's pretty straightforward one if i encode based on words I can come up with an arbitrary encoding say the word I is number one and then love my dog become 2 3 & 4 respectively now if I were to encode another sentence for example I love my cats the words I love my already have numbers so I can just use 1 2 & 3 again for them and I can create a new number for cats which I'll say is number 5 so now my sentences are 1 2 3 4 and 1 2 3 5 what's interesting here is now that the words are gone and the tokens for the words are just used we can begin to tell that there's a similarity between the sentences so maybe we're beginning to get a glimpse at what it might look like to have sentences turned into numbers yet maintain some kind of meaning the process I just outlined is called tokenization and it's an inherent part of doing natural language processing or NLP tensorflow gives you AP eyes that help you to achieve this very simply we'll take a look at them next here's all the code that you would need to tokenize the sentence as I showed earlier we can break it down and go through it line by line the tokenizer tools are part of the tensor flow Karros libraries and they're in the pre-processing namespace so make sure you import these I'm going to hard-code the sentences into an array now while this is a super simple corpus there's just two sentences and five unique words this design pattern can work for much bigger sets of data you'll soon be working with tens of thousands of sentences with thousands of unique words and it's all pretty much the same code so don't worry right now if the looks a little bit too simplistic you can then create a tokenizer with a lowercase T by simply creating an instance of the tokenizer uppercase T and initializing that with parameters one of these is the num words parameter which specifies the maximum number of words that you want to care about there's only five unique words here so it doesn't really make a difference but with larger sets of text it can you'll commonly encounter bodies of text weren't many thousands of unique words in them and lots of these words may be only used once or twice by specifying the number of words that you care about in your tokenizer you get an easy way to filter those out the tokenizer is smart enough to assign tokens to words based on how commonly used they are in the corpus so the most common word will be at index 1 the next common word via index 2 etc etc to get the tokenizer to do its job you can fit it on texts and pass it your corpus of text in this case it's our simple array of sentences to see the word index that the tokenizer created you can just get the word index property this will give you a set of name value pairs where the name is the word and the value is the token for that word and then you can just print this to inspect it when you print it out it won't necessarily be in any order but keep an eye on the values like I said earlier the most common words will be the lowest index and in this set I love my appears twice while dog and cat both appear once so I love my are the lower index words one two and three and dog and cat are the higher indexed ones four and five so what if we expand our sentences and then add some more content like maybe you love my dog with an exclamation mark and note that exclamation the default behavior of the tokenizer is to strip punctuation out like this it can be overridden but we'll keep it in for now it also makes all of the words become lowercase so my capital deed dog will be treated in the same way as a lowercase D dog the tokens will now look like this notice that they've moved around a little love is now the number one token because it's the most used word and it's similar with my also notice that dog lost its exclamation and there's only one token in here for dog and it represents both usages of the word despite the exclamation being on the second one and we've added a new word u because that was first used in the new sentence that was added to the corpus so I hope you found that pretty straightforward despite the underlying power in the tokenizer next up I'm gonna step through a collab with the code for all of this and then you can try it out for yourself okay here's the URL of that collab pause the video to give it a try for yourself this is the code that we were looking at in the videos here we can see we're going to import the tokenizer and i have a number of sentences here i love my dog lowercase i i comma love my cat uppercase i and things like you love my dog with an exclamation will then use the tokenizer and we'll fit it on the text of these sentences and take a look at the word index so here if we take a look at the word index will see that the word love is indexed as number one it became token number one and that's because love was used the most it was also my it was used the most as well there are three of those just like love and you can see there at the top of the index so for example if I were to add another sentence here like hello hello hello hello I am in a place called vertigo and then I were to run it on this and then would see hello is now the top one because there are four of them where there are three loves and three minds that kind of thing oh I also now shows up because there are three eyes so this is basically how the tokenizer works hopefully this will be useful for you have a play with this code to try some examples yourself and see what you can come up with great I hope that was easy for you in this video you got your first taste of NLP using tokenization where you were able to take sentences and have no words encoded into tokens the next step on your journey is to replace your sentences with sequences of tokens you'll see that in the next video so don't forget to hit that subscribe button and I'll see you there [Music]
Original Description
Machine Learning Foundations is a free training course where you’ll learn the fundamentals of building machine learned models using TensorFlow.
In Episode 8 we’ll switch gears from computer vision and take a look at Natural Language Processing, beginning with tokenization--how a computer can represent language in a numeric format that can be used in training neural networks.
Tokenization example → https://goo.gle/2uO6Gee
TensorFlow is Google’s end-to-end open source machine learning platform. For more videos about TensorFlow, subscribe to the TF YouTube channel → https://goo.gle/TensorFlow
Machine Learning Foundations playlist → https://goo.gle/ml-foundations
Subscribe to Google Developers → https://goo.gle/developers
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Google for Developers · Google for Developers · 56 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
▶
57
58
59
60
Developer Journey - Sunnyvale DSC Summit ‘19
Google for Developers
How Google is working with students - Sunnyvale DSC Summit ‘19
Google for Developers
Starting your career in the Cloud - Sunnyvale DSC Summit ‘19
Google for Developers
The Solution Challenge - Sunnyvale DSC Summit ‘19
Google for Developers
Firebase - Sunnyvale DSC Summit ‘19
Google for Developers
Cloud Hero - Sunnyvale DSC Summit ‘19
Google for Developers
Panel discussion - Sunnyvale DSC Summit ‘19
Google for Developers
The art of negotiation - Sunnyvale DSC Summit ‘19
Google for Developers
Courage to care, solve and share - Sunnyvale DSC Summit ‘19
Google for Developers
Version 9 of Angular, Glass Enterprise Edition 2, path to DX deprecation, & more!
Google for Developers
[DEPRECATING] Introducing a new series (Assistant for Developers Pro Tips)
Google for Developers
Detecting memory bugs with HWASan, Bazel 2.1, Next ‘20 session guide, & more!
Google for Developers
Why Podcast.app chose a .app domain name
Google for Developers
Machine Learning Bootcamp Jakarta 2019
Google for Developers
Android Studio 3.6, Android 11 Developer Preview, Kubeflow 1.0, & more!
Google for Developers
[DEPRECATING] Importance of community (Assistant on Air)
Google for Developers
Why the Flutter team switched from .io to a .dev domain name
Google for Developers
3 website-building tips from .dev creators
Google for Developers
Why NimbleDroid chose a .app domain name
Google for Developers
Android Platform Codelab, Bazel 2.2, Maps Android Utility Library v1.0, & more!
Google for Developers
Google for Games Developer Summit: A free, digital experience for game developers
Google for Developers
Inspecting Home Graph (Assistant for Developers Pro Tips)
Google for Developers
Google for Games Developer Summit Keynote
Google for Developers
Stadia Games & Entertainment presents: Keys to a great game pitch (Google Games Dev Summit)
Google for Developers
Empowering game developers with Stadia R&D (Google Games Dev Summit)
Google for Developers
Supercharging discoverability with Stadia (Google Games Dev Summit)
Google for Developers
Stadia Games & Entertainment presents: Creating for content creators (Google Games Dev Summit)
Google for Developers
Bringing Destiny to Stadia: A postmortem (Google Games Dev Summit)
Google for Developers
Live Captioning in Google Slides
Google for Developers
[DEPRECATING] User engagement for the Google Assistant
Google for Developers
TensorFlow Dev Summit ‘20, Google for Games Dev Summit, Cloud AI Platform Pipelines, & much more!
Google for Developers
Top 5 from the TensorFlow Dev Summit 2020
Google for Developers
Developer Student Clubs 2019 Turkey Leads Summit
Google for Developers
Building simpler payment experiences | Google Pay Plugin for Magento 2
Google for Developers
Become A Developer Student Club Lead
Google for Developers
Firebase Kotlin Extensions, ARM apps on the Android Emulator, Angular v9.1, & more!
Google for Developers
Test suite for Smart Home (Assistant for Developers Pro Tips)
Google for Developers
Google Play updates, Bazel 3.0, Business Console for Google Pay, & more!
Google for Developers
How to use error logs (Assistant for Developers Pro Tips)
Google for Developers
Contact Center AI, Android Studio 4.1 Canary 5, TensorFlow QAT API, & more!
Google for Developers
WebView DevTools, Kotlin meets gRPC, Flutter CodePen support, & more! (Episode 200)
Google for Developers
Offline handling for Smart Home (Assistant for Developers Pro Tips)
Google for Developers
Android 11 Dev Preview 3, Google Fonts for Flutter, Shielded VM, & more!
Google for Developers
Machine Learning Foundations: Ep #1 - What is ML?
Google for Developers
Flutter web support updates, BigQuery materialized views, Cloud Spanner emulator, & more!
Google for Developers
Computer vision by building a neural network with TensorFlow | Machine Learning Foundations
Google for Developers
Machine Learning Foundations: Ep #3 - Convolutions and pooling
Google for Developers
Android 11 Beta plans, Flutter 1.17, Dart 2.8, & much more!
Google for Developers
Machine Learning Foundations: Ep #4 - Coding with Convolutional Neural Networks
Google for Developers
Google Developers ML Summit
Google for Developers
Real-world image classification using convolutional neural networks | Machine Learning Foundations
Google for Developers
Adobe XD support for Flutter, Architecture Framework, temporary closures with Places API, & more!
Google for Developers
Machine Learning Foundations: Ep #6 - Convolutional cats and dogs
Google for Developers
Machine Learning Foundations: Ep #7 - Image augmentation and overfitting
Google for Developers
Announcing Firebase Live, Flutter Day, Java 11 on Google Cloud Functions, & more!
Google for Developers
Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing
Google for Developers
Android 11 Beta, Google Play Asset Delivery, Firebase Crashlytics SDK, & much more!
Google for Developers
Natural Language Processing: Using sequencing APIs in TensorFlow | Machine Learning Foundations
Google for Developers
Build a sarcasm classifier using NLP and TensorFlow | Machine Learning Foundations
Google for Developers
AR Realism with the ARCore Depth API
Google for Developers
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI