Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

Google for Developers · Beginner ·📐 ML Fundamentals ·6y ago

Skills: ML Maths Basics90%Supervised Learning60%

Key Takeaways

Builds Natural Language Processing foundations using tokenization with TensorFlow

Full Transcript

hi and welcome to episode 8 of machine learning foundations I'm Laurence Moroney from the Google AI team and I'm here to be your guide through the basics of machine learning up to now you've learned how machine learning works and explored examples in computer vision by doing image classification including understanding concepts such as convolutional neural networks for feature identification an image augmentation to avoid overfitting making your networks that little bit smarter we're now going to switch gears and we'll take a look at natural language processing in this video we'll take a look at how a computer can represent language and that's words and sentences in a numeric format that can then later be used to train neural networks this process is called tokenization so let's get started consider this word it's the English word listen and it consists of six letters we're used to reading it based on the sounds and putting those sounds together to form a word but how can a computer understand this word well one way as computers deal better with numbers than they do with letters is to assign a number to each letter a common coding format is called ASCII where common letters and symbols are encoded into the values from 0 to 255 it's useful in that only one byte is needed to store the value for a letter but it has been superseded by later encodings in order to give access to characters and letters beyond 255 in particular international characters but for the purposes of illustration we can stick with ASCII where for example the letter L is 76 I is 73 and so on so we now have the word listen encoded into six bytes one for each letter now this is a perfectly valid encoding and often when you use neural networks you'll see character encoding or sub word encoding and stuff like that they lead to things being a little bit more complicated and in these tutorials I'm going to do word based encoding and not the letter based that we just saw now why would I do this one reason is that if we're taking a word as a set of numbers unless we take the sequence of those numbers into account we can have two words sometimes with opposite ish meanings like this and they can have the same letters thus if we want to use character based encoding a computer can't tell the difference between these two words unless we have a sequence model and that's a little bit more complicated than we need to look into right now so let's consider a different encoding and that's a word based one that way each of these words can be represented by a single number and each number will be different there's also a nice hidden advantage to this which we'll see in a moment so consider this sentence I love my dog it's pretty straightforward one if i encode based on words I can come up with an arbitrary encoding say the word I is number one and then love my dog become 2 3 & 4 respectively now if I were to encode another sentence for example I love my cats the words I love my already have numbers so I can just use 1 2 & 3 again for them and I can create a new number for cats which I'll say is number 5 so now my sentences are 1 2 3 4 and 1 2 3 5 what's interesting here is now that the words are gone and the tokens for the words are just used we can begin to tell that there's a similarity between the sentences so maybe we're beginning to get a glimpse at what it might look like to have sentences turned into numbers yet maintain some kind of meaning the process I just outlined is called tokenization and it's an inherent part of doing natural language processing or NLP tensorflow gives you AP eyes that help you to achieve this very simply we'll take a look at them next here's all the code that you would need to tokenize the sentence as I showed earlier we can break it down and go through it line by line the tokenizer tools are part of the tensor flow Karros libraries and they're in the pre-processing namespace so make sure you import these I'm going to hard-code the sentences into an array now while this is a super simple corpus there's just two sentences and five unique words this design pattern can work for much bigger sets of data you'll soon be working with tens of thousands of sentences with thousands of unique words and it's all pretty much the same code so don't worry right now if the looks a little bit too simplistic you can then create a tokenizer with a lowercase T by simply creating an instance of the tokenizer uppercase T and initializing that with parameters one of these is the num words parameter which specifies the maximum number of words that you want to care about there's only five unique words here so it doesn't really make a difference but with larger sets of text it can you'll commonly encounter bodies of text weren't many thousands of unique words in them and lots of these words may be only used once or twice by specifying the number of words that you care about in your tokenizer you get an easy way to filter those out the tokenizer is smart enough to assign tokens to words based on how commonly used they are in the corpus so the most common word will be at index 1 the next common word via index 2 etc etc to get the tokenizer to do its job you can fit it on texts and pass it your corpus of text in this case it's our simple array of sentences to see the word index that the tokenizer created you can just get the word index property this will give you a set of name value pairs where the name is the word and the value is the token for that word and then you can just print this to inspect it when you print it out it won't necessarily be in any order but keep an eye on the values like I said earlier the most common words will be the lowest index and in this set I love my appears twice while dog and cat both appear once so I love my are the lower index words one two and three and dog and cat are the higher indexed ones four and five so what if we expand our sentences and then add some more content like maybe you love my dog with an exclamation mark and note that exclamation the default behavior of the tokenizer is to strip punctuation out like this it can be overridden but we'll keep it in for now it also makes all of the words become lowercase so my capital deed dog will be treated in the same way as a lowercase D dog the tokens will now look like this notice that they've moved around a little love is now the number one token because it's the most used word and it's similar with my also notice that dog lost its exclamation and there's only one token in here for dog and it represents both usages of the word despite the exclamation being on the second one and we've added a new word u because that was first used in the new sentence that was added to the corpus so I hope you found that pretty straightforward despite the underlying power in the tokenizer next up I'm gonna step through a collab with the code for all of this and then you can try it out for yourself okay here's the URL of that collab pause the video to give it a try for yourself this is the code that we were looking at in the videos here we can see we're going to import the tokenizer and i have a number of sentences here i love my dog lowercase i i comma love my cat uppercase i and things like you love my dog with an exclamation will then use the tokenizer and we'll fit it on the text of these sentences and take a look at the word index so here if we take a look at the word index will see that the word love is indexed as number one it became token number one and that's because love was used the most it was also my it was used the most as well there are three of those just like love and you can see there at the top of the index so for example if I were to add another sentence here like hello hello hello hello I am in a place called vertigo and then I were to run it on this and then would see hello is now the top one because there are four of them where there are three loves and three minds that kind of thing oh I also now shows up because there are three eyes so this is basically how the tokenizer works hopefully this will be useful for you have a play with this code to try some examples yourself and see what you can come up with great I hope that was easy for you in this video you got your first taste of NLP using tokenization where you were able to take sentences and have no words encoded into tokens the next step on your journey is to replace your sentences with sequences of tokens you'll see that in the next video so don't forget to hit that subscribe button and I'll see you there [Music]

Original Description

Machine Learning Foundations is a free training course where you’ll learn the fundamentals of building machine learned models using TensorFlow. In Episode 8 we’ll switch gears from computer vision and take a look at Natural Language Processing, beginning with tokenization--how a computer can represent language in a numeric format that can be used in training neural networks. Tokenization example → https://goo.gle/2uO6Gee TensorFlow is Google’s end-to-end open source machine learning platform. For more videos about TensorFlow, subscribe to the TF YouTube channel → https://goo.gle/TensorFlow Machine Learning Foundations playlist → https://goo.gle/ml-foundations Subscribe to Google Developers → https://goo.gle/developers

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google for Developers · Google for Developers · 56 of 60

← Previous Next →

Developer Journey - Sunnyvale DSC Summit ‘19

Developer Journey - Sunnyvale DSC Summit ‘19

Google for Developers

How Google is working with students - Sunnyvale DSC Summit ‘19

How Google is working with students - Sunnyvale DSC Summit ‘19

Google for Developers

Starting your career in the Cloud - Sunnyvale DSC Summit ‘19

Starting your career in the Cloud - Sunnyvale DSC Summit ‘19

Google for Developers

The Solution Challenge - Sunnyvale DSC Summit ‘19

The Solution Challenge - Sunnyvale DSC Summit ‘19

Google for Developers

Firebase - Sunnyvale DSC Summit ‘19

Firebase - Sunnyvale DSC Summit ‘19

Google for Developers

Cloud Hero - Sunnyvale DSC Summit ‘19

Cloud Hero - Sunnyvale DSC Summit ‘19

Google for Developers

Panel discussion - Sunnyvale DSC Summit ‘19

Panel discussion - Sunnyvale DSC Summit ‘19

Google for Developers

The art of negotiation - Sunnyvale DSC Summit ‘19

The art of negotiation - Sunnyvale DSC Summit ‘19

Google for Developers

Courage to care, solve and share - Sunnyvale DSC Summit ‘19

Courage to care, solve and share - Sunnyvale DSC Summit ‘19

Google for Developers

Version 9 of Angular, Glass Enterprise Edition 2, path to DX deprecation, & more!

Version 9 of Angular, Glass Enterprise Edition 2, path to DX deprecation, & more!

Google for Developers

[DEPRECATING] Introducing a new series (Assistant for Developers Pro Tips)

[DEPRECATING] Introducing a new series (Assistant for Developers Pro Tips)

Google for Developers

Detecting memory bugs with HWASan, Bazel 2.1, Next ‘20 session guide, & more!

Detecting memory bugs with HWASan, Bazel 2.1, Next ‘20 session guide, & more!

Google for Developers

Why Podcast.app chose a .app domain name

Why Podcast.app chose a .app domain name

Google for Developers

Machine Learning Bootcamp Jakarta 2019

Machine Learning Bootcamp Jakarta 2019

Google for Developers

Android Studio 3.6, Android 11 Developer Preview, Kubeflow 1.0, & more!

Android Studio 3.6, Android 11 Developer Preview, Kubeflow 1.0, & more!

Google for Developers

[DEPRECATING] Importance of community (Assistant on Air)

[DEPRECATING] Importance of community (Assistant on Air)

Google for Developers

Why the Flutter team switched from .io to a .dev domain name

Why the Flutter team switched from .io to a .dev domain name

Google for Developers

3 website-building tips from .dev creators

3 website-building tips from .dev creators

Google for Developers

Why NimbleDroid chose a .app domain name

Why NimbleDroid chose a .app domain name

Google for Developers

Android Platform Codelab, Bazel 2.2, Maps Android Utility Library v1.0, & more!

Android Platform Codelab, Bazel 2.2, Maps Android Utility Library v1.0, & more!

Google for Developers

Google for Games Developer Summit: A free, digital experience for game developers

Google for Games Developer Summit: A free, digital experience for game developers

Google for Developers

Inspecting Home Graph (Assistant for Developers Pro Tips)

Inspecting Home Graph (Assistant for Developers Pro Tips)

Google for Developers

Google for Games Developer Summit Keynote

Google for Games Developer Summit Keynote

Google for Developers

Stadia Games & Entertainment presents: Keys to a great game pitch (Google Games Dev Summit)

Stadia Games & Entertainment presents: Keys to a great game pitch (Google Games Dev Summit)

Google for Developers

Empowering game developers with Stadia R&D (Google Games Dev Summit)

Empowering game developers with Stadia R&D (Google Games Dev Summit)

Google for Developers

Supercharging discoverability with Stadia (Google Games Dev Summit)

Supercharging discoverability with Stadia (Google Games Dev Summit)

Google for Developers

Stadia Games & Entertainment presents: Creating for content creators (Google Games Dev Summit)

Stadia Games & Entertainment presents: Creating for content creators (Google Games Dev Summit)

Google for Developers

Bringing Destiny to Stadia: A postmortem (Google Games Dev Summit)

Bringing Destiny to Stadia: A postmortem (Google Games Dev Summit)

Google for Developers

Live Captioning in Google Slides

Live Captioning in Google Slides

Google for Developers

[DEPRECATING] User engagement for the Google Assistant

[DEPRECATING] User engagement for the Google Assistant

Google for Developers

TensorFlow Dev Summit ‘20, Google for Games Dev Summit, Cloud AI Platform Pipelines, & much more!

TensorFlow Dev Summit ‘20, Google for Games Dev Summit, Cloud AI Platform Pipelines, & much more!

Google for Developers

Top 5 from the TensorFlow Dev Summit 2020

Top 5 from the TensorFlow Dev Summit 2020

Google for Developers

Developer Student Clubs 2019 Turkey Leads Summit

Developer Student Clubs 2019 Turkey Leads Summit

Google for Developers

Building simpler payment experiences | Google Pay Plugin for Magento 2

Building simpler payment experiences | Google Pay Plugin for Magento 2

Google for Developers

Become A Developer Student Club Lead

Become A Developer Student Club Lead

Google for Developers

Firebase Kotlin Extensions, ARM apps on the Android Emulator, Angular v9.1, & more!

Firebase Kotlin Extensions, ARM apps on the Android Emulator, Angular v9.1, & more!

Google for Developers

Test suite for Smart Home (Assistant for Developers Pro Tips)

Test suite for Smart Home (Assistant for Developers Pro Tips)

Google for Developers

Google Play updates, Bazel 3.0, Business Console for Google Pay, & more!

Google Play updates, Bazel 3.0, Business Console for Google Pay, & more!

Google for Developers

How to use error logs (Assistant for Developers Pro Tips)

How to use error logs (Assistant for Developers Pro Tips)

Google for Developers

Contact Center AI, Android Studio 4.1 Canary 5, TensorFlow QAT API, & more!

Contact Center AI, Android Studio 4.1 Canary 5, TensorFlow QAT API, & more!

Google for Developers

WebView DevTools, Kotlin meets gRPC, Flutter CodePen support, & more! (Episode 200)

WebView DevTools, Kotlin meets gRPC, Flutter CodePen support, & more! (Episode 200)

Google for Developers

Offline handling for Smart Home (Assistant for Developers Pro Tips)

Offline handling for Smart Home (Assistant for Developers Pro Tips)

Google for Developers

Android 11 Dev Preview 3, Google Fonts for Flutter, Shielded VM, & more!

Android 11 Dev Preview 3, Google Fonts for Flutter, Shielded VM, & more!

Google for Developers

Machine Learning Foundations: Ep #1 - What is ML?

Machine Learning Foundations: Ep #1 - What is ML?

Google for Developers

Flutter web support updates, BigQuery materialized views, Cloud Spanner emulator, & more!

Flutter web support updates, BigQuery materialized views, Cloud Spanner emulator, & more!

Google for Developers

Computer vision by building a neural network with TensorFlow | Machine Learning Foundations

Computer vision by building a neural network with TensorFlow | Machine Learning Foundations

Google for Developers

Machine Learning Foundations: Ep #3 - Convolutions and pooling

Machine Learning Foundations: Ep #3 - Convolutions and pooling

Google for Developers

Android 11 Beta plans, Flutter 1.17, Dart 2.8, & much more!

Android 11 Beta plans, Flutter 1.17, Dart 2.8, & much more!

Google for Developers

Machine Learning Foundations: Ep #4 - Coding with Convolutional Neural Networks

Machine Learning Foundations: Ep #4 - Coding with Convolutional Neural Networks

Google for Developers

Google Developers ML Summit

Google Developers ML Summit

Google for Developers

Real-world image classification using convolutional neural networks | Machine Learning Foundations

Real-world image classification using convolutional neural networks | Machine Learning Foundations

Google for Developers

Adobe XD support for Flutter, Architecture Framework, temporary closures with Places API, & more!

Adobe XD support for Flutter, Architecture Framework, temporary closures with Places API, & more!

Google for Developers

Machine Learning Foundations: Ep #6 - Convolutional cats and dogs

Machine Learning Foundations: Ep #6 - Convolutional cats and dogs

Google for Developers

Machine Learning Foundations: Ep #7 - Image augmentation and overfitting

Machine Learning Foundations: Ep #7 - Image augmentation and overfitting

Google for Developers

Announcing Firebase Live, Flutter Day, Java 11 on Google Cloud Functions, & more!

Announcing Firebase Live, Flutter Day, Java 11 on Google Cloud Functions, & more!

Google for Developers

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

Google for Developers

Android 11 Beta, Google Play Asset Delivery, Firebase Crashlytics SDK, & much more!

Android 11 Beta, Google Play Asset Delivery, Firebase Crashlytics SDK, & much more!

Google for Developers

Natural Language Processing: Using sequencing APIs in TensorFlow | Machine Learning Foundations

Natural Language Processing: Using sequencing APIs in TensorFlow | Machine Learning Foundations

Google for Developers

Build a sarcasm classifier using NLP and TensorFlow | Machine Learning Foundations

Build a sarcasm classifier using NLP and TensorFlow | Machine Learning Foundations

Google for Developers

AR Realism with the ARCore Depth API

AR Realism with the ARCore Depth API

Google for Developers

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

How to Learn a Hard Technical Skill Without Burning Out

Learn how to acquire hard technical skills without burnout by creating a sustainable learning plan

Dev.to · Anas Kalthoum | FreeBrain

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Learn Deep Learning by Hand (Beginner's Guide - Part 1)