BERT

Data Skeptic · Beginner ·🧠 Large Language Models ·6y ago

Key Takeaways

The video discusses BERT, a powerful tool for natural language processing projects, and its applications in various tasks such as spell checking, sentence disambiguation, and named entity recognition, using techniques like fine-tuning and retrieval augmented generation.

Full Transcript

[Music] all right Linda our topic for today is Burt that's a BER T just like the name which stands for bi-directional encoder representation from transformers so I guess it's actually birthed but they went with Burt now that's a very technical name and I don't want to even get into all the technicalities with you on this show because we're gonna do a couple of episodes in a row all about Burt to wind down our natural language processing series so what I'd like to do with you is give you a sense of why Burt is so important why it's a big milestone and why is also really really cool all right let's jump in so Burt is a neural network that takes as input any arbitrary length of text so it could be one sentence could be a paragraph that's really nice because text doesn't fit into a form very well right it's not like every sentence is a certain number of characters now you have that on Twitter right where there's an upper limit but in general text is of arbitrary length and machine learning is not good at handling arbitrarily long things in general so Burt is built on kind of a sequential model but what's nice about it is its output is a fixed length vector you know I remember you brought up in the show but you know what remind me so depending on which version of it you choose you get like either 768 or 128 length of a vector which is just some numeric representation of that text and the very surprising thing or somewhat surprising thing is that those numbers which don't have any obvious meaning are a really good way to do automated feature engineering so they kind of prepare the text into a numeric format so that then traditional machine learning techniques can learn it very quickly so you're saying a fixed vector is something interpreted as numbers it is numbers it's a list of numbers Rushden numbers and it allows this machine to be able to interpret it easily it's almost like translation step into some secret machine language yeah okay so it's just another way to encode something for the machine yes precisely in a way that's very amenable to learning it does that using a trick called masking now we've talked about this before you might remember we talked about word Tyvek where I'd say you know there was a sentence with a missing word and the machine would learn to predict it based on the context machine can learn it like the sentence my blank was late this week my blank my train my check was late this week my bus my dinner was late this week wow you're all about like food and paycheck so okay word Tyvek would do a good job guessing what fits in that blank but now let me give you a place where word Tyvek wouldn't necessarily gain any more advantage what if the pre preceding parts of that were I need to call my boss my blank was late this week well it's probably a check yeah more likely a check but it could be my report was late this week my vendor was late this week you know it could be lots of stuff like that but yeah check seems to fit very nicely I mean if you guys send packages maybe your package came late oh yeah good one my delivery was late this week I need I need to call my boss my delivery was late this week yeah so these are all very likely candidates to fit in that blank and you and I know that from our experience with language Burt learns kind of the same way it looks at a massive massive training corpus and says what kind of patterns do I see here and from that it's able to kind of learn the context and what surrounding words do to help inform the middle word and that's the same idea as word Tyvek that we've had for awhile but what Burt does a little better is through this transformer process encode information and keep it around a little bit longer almost as though it has a memory so let me continue with our example here and I'll give you two different sentences and I'm gonna ask you for candidate words at the blank okay all right I'm ready I am really really worried about losing my job I need to call my boss my blank was late this week well at that point I'm probably thinking it's a deliverable from that probably a report or something whatever they're on the hook for okay here's another sentence my company is going through some weird financial things right now I need to call my boss my blank was late this week oh well in that case it's probably a check yeah right so here we are two sentences before the blank and there's information there you need to correctly figure out what goes in the blank well you know it helps it definitely give us context yeah and that context is seems to be required to solve problems like this so Bert is leveraging a lot of the best ideas in deep learning and frameworks and techniques and that sort of thing to build up based on a massive training set a model that translate any text you have into an embedding so it's a numeric representation that contains the concept and what sort of core idea is represented there and that can then be translated back into words thanks to this week's sponsor brilliant org brilliant is a problem-solving website and app with a hands-on approach and over 50 interactive courses these courses are a mix of storytelling code writing interactive challenges and problems to solve I'm gonna highlight a few four different types of listeners if you don't already have these skills sign up immediately for computer science essentials by visiting brilliant org slash data skeptic there you can dive into big ideas in algorithm design that course is designed for anyone learning computer science for the first time and programmers learning to deepen their understanding of algorithms now if you'd rather just have fun but in a challenging an intellectual way I recommend the puzzle science course it's brand new very neat you'll develop a solid foundation in physics while playing with puzzles and lastly if neither of those appeal to you surely you'd enjoy beautiful geometry where you can learn about things like the mathematics of origami effective learning is about problem solving and brilliant will help you learn and get practice you'll come away better at problem solving so check out brilliant org slash data skeptic once more that's brilliant dot org slash data skeptic so this all sounds like a neat idea of you know hey maybe it worked great maybe this could understand lang would your learned patterns in language but you gotta do something empirical right decide like is this actually a good framework to solve these sorts of natural language problems so what they did was with Burt was they first trained it as it was intended they got a really huge corpus and then just trained this unsupervised embedding model that's essentially trying to predict masked words by looking at what comes before and after it and building up a layer of Transformers that can hopefully capture some of the the knowledge that needs to be there to represent the actual information or ideas that are there that's what the embedding is supposed to emulate the embedding is supposed to emulate ideas yeah like a numeric version of an idea and the embedding happens on the burt side or where yeah Burt is a model I mean it's an algorithm it's all these things but what you use it for in like a industry system or whatever is to take in raw text and get as output a fixed-length vector that just numbers so what are those numbers useful for so they can be used in a variety of different ways one way you could use them is for similarity that vector can be thought of as a point in n-dimensional space so just like you know we picture 2d things right because we can draw it there's x and y and there's points on the plot can you picture that there's an exit oh I and then there's points okay typical graphs and stuff right now you can't picture maybe you can picture 3d plots too but I know you can't picture 40 and above and this is in like a hundred and twenty four dimensional space probably so you can't picture it but all the mathematics works the same so think of it in like a two dimensional space I can take any piece of text and then convert it into two numbers now actually two numbers is not enough to convey all the information but we're gonna simplify it here so you give me a sentence like I am so mad at this company and I convert that into the two numbers let's say it's 1.1 X and 1.3 Y and then you could take another sentence like this is my favorite company ever and that would convert into two numbers as well and they would probably be kind of far apart because those ideas are different whereas if you converted another sentence like I am so disappointed in this corporation none of the words were the same as my first example but essentially I said the same thing right so you would hope that it assigns relatively speaking the same numeric values to both of those sentences right okay I'm following the degree to which that happens you get to take advantage of that so you could say like cluster documents and say these are all similar but actually you can use this bert stuff on an extremely wide array of tasks so there's a classic set of problems in natural language processing like spell checking and well that's an easy one but sentence disambiguation and named entity recognition and stuff like that that you might want to do against text most of the time historically people have built a specialized algorithm to try and solve those problems that really tackles that exact use case so for example if you want to build a chatbot that does question answering you would only work on question answering you wouldn't try and make the bots so that it could also do part of speech tagging you know what I mean okay I'm following you got me all right yeah you build specialized tools but with Bert if you just take that vector that numeric representation and you use it as features and some machine learning exercise Bert seems to be beating all the best known algorithms on all those generic tasks this wide array of tasks in fact but it's not specialized to any of them so it's this like much more general case tool than we've had before in natural language processing pretty awesome huh so I'm not sure why is it awesome because if you want to do a project in natural language processing like make a document classifier or build a special filter for your email or something like that to do it with machine learning you need a huge amount of training data and a lot of compute power to train a model now your own email even though it might seem like you know feel like you get a lot of email or something like that in your life probably in the grand scheme of things it's not that much it could probably fit on a small disk so you don't have a lot of examples for training and you can't get more right unless you just want to write emails all day so for whatever reason a lot of problems start from having not that many examples and that makes starting from scratch on machine learning very hard Bert gives you like a rocket booster head start by saying you know here is this unsupervised model that we've trained on general language I think it's on the Wikipedia corpus or maybe and don't quote me on that but it's on something large like that and there are other versions of this that are coming out trained on the reddit corpus and stuff like that but trained on this wide broad example of text so that we can apply transfer learning that we talked about a few weeks ago where essentially you start from a system that has a good ability to detect the context and meaning of certain words and then all you do is refine it for your use case so you don't have the burden of starting from scratch and eating a hundred thousand documents you can get by with maybe a hundred documents because you got this headstart okay so you're saying it it's kind of like a shortcut very much so it's like the whistle and Mario three I don't remember that what is the whistle do again jeez Linda calls the tornado and why is that tornado good because it takes you to the warp zone bert is also a little creepy in how good of a job it can do at predicting next words and you know kind of writing things i'm in short order and we'll announce this on the show when it's ready but gonna have a demo of that that you can use on our site so not ready yet but we'll talk about that in the future when that's checked out so you can actually have a conversation with Bert I don't think you will find that it passes the Turing test but it's very interesting and almost a little bit creepy all right well I look forward to being creeped out well then to have I brought you to a point where you appreciate the milestone which Bert represents well I think you definitely shed a little bit more light than I had before so thank you all right it's like the whistle and Mario well that's a yeah a little bit blunt that just in terms of how can accelerate your natural language processing project but it's a yeah it's a pre-training you can extend takes up a heck of a lot of memory to use but well worth it awesome anyway thanks as always for joining me Linda Thank You Kyle [Music]

Original Description

Kyle provides a non-technical overview of why Bidirectional Encoder Representations from Transformers (BERT) is a powerful tool for natural language processing projects.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 0 of 60

← Previous Next →
1 Data Skeptic book giveaway contest winner selection
Data Skeptic book giveaway contest winner selection
Data Skeptic
2 OpenHouse - Front end and API overview
OpenHouse - Front end and API overview
Data Skeptic
3 OpenHouse Crawling with AWS Lambda
OpenHouse Crawling with AWS Lambda
Data Skeptic
4 [MINI] Logistic Regression on Audio Data
[MINI] Logistic Regression on Audio Data
Data Skeptic
5 Data Provenance and Reproducibility with Pachyderm
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
6 [MINI] Primer on Deep Learning
[MINI] Primer on Deep Learning
Data Skeptic
7 Big Data Tools and Trends
Big Data Tools and Trends
Data Skeptic
8 [MINI] Automated Feature Engineering
[MINI] Automated Feature Engineering
Data Skeptic
9 The Data Refuge Project
The Data Refuge Project
Data Skeptic
10 [MINI] The Perceptron
[MINI] The Perceptron
Data Skeptic
11 [MINI] Feed Forward Neural Networks
[MINI] Feed Forward Neural Networks
Data Skeptic
12 Data Science at Patreon
Data Science at Patreon
Data Skeptic
13 [MINI] Backpropagation
[MINI] Backpropagation
Data Skeptic
14 [MINI] GPU CPU
[MINI] GPU CPU
Data Skeptic
15 OpenHouse
OpenHouse
Data Skeptic
16 [MINI] Generative Adversarial Networks
[MINI] Generative Adversarial Networks
Data Skeptic
17 [MINI] AdaBoost
[MINI] AdaBoost
Data Skeptic
18 [MINI] The Bootstrap
[MINI] The Bootstrap
Data Skeptic
19 [MINI] Dropout
[MINI] Dropout
Data Skeptic
20 [MINI] Gini Coefficients
[MINI] Gini Coefficients
Data Skeptic
21 [MINI] Random Forest
[MINI] Random Forest
Data Skeptic
22 [MINI] Heteroskedasticity
[MINI] Heteroskedasticity
Data Skeptic
23 [MINI] ANOVA
[MINI] ANOVA
Data Skeptic
24 Urban Congestion
Urban Congestion
Data Skeptic
25 [MINI] The CAP Theorem
[MINI] The CAP Theorem
Data Skeptic
26 Unstructured Data for Finance
Unstructured Data for Finance
Data Skeptic
27 Detecting Terrorists with Facial Recognition?
Detecting Terrorists with Facial Recognition?
Data Skeptic
28 Predictive Models on Random Data
Predictive Models on Random Data
Data Skeptic
29 [MINI] Entropy
[MINI] Entropy
Data Skeptic
30 [MINI] F1 Score
[MINI] F1 Score
Data Skeptic
31 Causal Impact
Causal Impact
Data Skeptic
32 Machine Learning on Images with Noisy Human-centric Labels
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
33 The Library Problem
The Library Problem
Data Skeptic
34 Stealing Models from the Cloud
Stealing Models from the Cloud
Data Skeptic
35 Data Science at eHarmony
Data Science at eHarmony
Data Skeptic
36 Multiple Comparisons and Conversion Optimization
Multiple Comparisons and Conversion Optimization
Data Skeptic
37 Election Predictions
Election Predictions
Data Skeptic
38 [MINI] Calculating Feature Importance
[MINI] Calculating Feature Importance
Data Skeptic
39 MS Connect Conference
MS Connect Conference
Data Skeptic
40 Music21
Music21
Data Skeptic
41 The Police Data and the Data Driven Justice Initiatives
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
42 Studying Competition and Gender Through Chess
Studying Competition and Gender Through Chess
Data Skeptic
43 [MINI] Goodhart's Law
[MINI] Goodhart's Law
Data Skeptic
44 Trusting Machine Learning Models with LIME
Trusting Machine Learning Models with LIME
Data Skeptic
45 [MINI] Leakage
[MINI] Leakage
Data Skeptic
46 Predictive Policing
Predictive Policing
Data Skeptic
47 Mutli-Agent Diverse Generative Adversarial Networks
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
48 [MINI] Convolutional Neural Networks
[MINI] Convolutional Neural Networks
Data Skeptic
49 Unsupervised Depth Perception
Unsupervised Depth Perception
Data Skeptic
50 [MINI] Max-pooling
[MINI] Max-pooling
Data Skeptic
51 MS Build 2017
MS Build 2017
Data Skeptic
52 Activation Functions
Activation Functions
Data Skeptic
53 Doctor AI
Doctor AI
Data Skeptic
54 [MINI] The Vanishing Gradient
[MINI] The Vanishing Gradient
Data Skeptic
55 CosmosDB
CosmosDB
Data Skeptic
56 Estimating Sheep Pain with Facial Recognition
Estimating Sheep Pain with Facial Recognition
Data Skeptic
57 [MINI] Conditional Independence
[MINI] Conditional Independence
Data Skeptic
58 MINI: Bayesian Belief Networks
MINI: Bayesian Belief Networks
Data Skeptic
59 Project Common Voice
Project Common Voice
Data Skeptic
60 [MINI] Recurrent Neural Networks
[MINI] Recurrent Neural Networks
Data Skeptic

This video provides a non-technical overview of BERT, its applications, and its potential in natural language processing projects. BERT is a powerful tool that can be used for various tasks such as spell checking, sentence disambiguation, and named entity recognition. The video discusses how BERT uses a transformer process to encode information and keep it around longer, almost like having a memory, and how it can be fine-tuned for specific tasks.

Key Takeaways
  1. Understand the basics of BERT
  2. Apply BERT in NLP tasks
  3. Fine-tune BERT models
  4. Use BERT for similarity tasks
  5. Use BERT for spell checking, sentence disambiguation, and named entity recognition
💡 BERT's ability to generate a fixed-length vector from raw text makes it a powerful tool for various NLP tasks, and its pre-trained model gives a head start in machine learning by applying transfer learning.

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →