BERT
Key Takeaways
The video discusses BERT, a powerful tool for natural language processing projects, and its applications in various tasks such as spell checking, sentence disambiguation, and named entity recognition, using techniques like fine-tuning and retrieval augmented generation.
Full Transcript
[Music] all right Linda our topic for today is Burt that's a BER T just like the name which stands for bi-directional encoder representation from transformers so I guess it's actually birthed but they went with Burt now that's a very technical name and I don't want to even get into all the technicalities with you on this show because we're gonna do a couple of episodes in a row all about Burt to wind down our natural language processing series so what I'd like to do with you is give you a sense of why Burt is so important why it's a big milestone and why is also really really cool all right let's jump in so Burt is a neural network that takes as input any arbitrary length of text so it could be one sentence could be a paragraph that's really nice because text doesn't fit into a form very well right it's not like every sentence is a certain number of characters now you have that on Twitter right where there's an upper limit but in general text is of arbitrary length and machine learning is not good at handling arbitrarily long things in general so Burt is built on kind of a sequential model but what's nice about it is its output is a fixed length vector you know I remember you brought up in the show but you know what remind me so depending on which version of it you choose you get like either 768 or 128 length of a vector which is just some numeric representation of that text and the very surprising thing or somewhat surprising thing is that those numbers which don't have any obvious meaning are a really good way to do automated feature engineering so they kind of prepare the text into a numeric format so that then traditional machine learning techniques can learn it very quickly so you're saying a fixed vector is something interpreted as numbers it is numbers it's a list of numbers Rushden numbers and it allows this machine to be able to interpret it easily it's almost like translation step into some secret machine language yeah okay so it's just another way to encode something for the machine yes precisely in a way that's very amenable to learning it does that using a trick called masking now we've talked about this before you might remember we talked about word Tyvek where I'd say you know there was a sentence with a missing word and the machine would learn to predict it based on the context machine can learn it like the sentence my blank was late this week my blank my train my check was late this week my bus my dinner was late this week wow you're all about like food and paycheck so okay word Tyvek would do a good job guessing what fits in that blank but now let me give you a place where word Tyvek wouldn't necessarily gain any more advantage what if the pre preceding parts of that were I need to call my boss my blank was late this week well it's probably a check yeah more likely a check but it could be my report was late this week my vendor was late this week you know it could be lots of stuff like that but yeah check seems to fit very nicely I mean if you guys send packages maybe your package came late oh yeah good one my delivery was late this week I need I need to call my boss my delivery was late this week yeah so these are all very likely candidates to fit in that blank and you and I know that from our experience with language Burt learns kind of the same way it looks at a massive massive training corpus and says what kind of patterns do I see here and from that it's able to kind of learn the context and what surrounding words do to help inform the middle word and that's the same idea as word Tyvek that we've had for awhile but what Burt does a little better is through this transformer process encode information and keep it around a little bit longer almost as though it has a memory so let me continue with our example here and I'll give you two different sentences and I'm gonna ask you for candidate words at the blank okay all right I'm ready I am really really worried about losing my job I need to call my boss my blank was late this week well at that point I'm probably thinking it's a deliverable from that probably a report or something whatever they're on the hook for okay here's another sentence my company is going through some weird financial things right now I need to call my boss my blank was late this week oh well in that case it's probably a check yeah right so here we are two sentences before the blank and there's information there you need to correctly figure out what goes in the blank well you know it helps it definitely give us context yeah and that context is seems to be required to solve problems like this so Bert is leveraging a lot of the best ideas in deep learning and frameworks and techniques and that sort of thing to build up based on a massive training set a model that translate any text you have into an embedding so it's a numeric representation that contains the concept and what sort of core idea is represented there and that can then be translated back into words thanks to this week's sponsor brilliant org brilliant is a problem-solving website and app with a hands-on approach and over 50 interactive courses these courses are a mix of storytelling code writing interactive challenges and problems to solve I'm gonna highlight a few four different types of listeners if you don't already have these skills sign up immediately for computer science essentials by visiting brilliant org slash data skeptic there you can dive into big ideas in algorithm design that course is designed for anyone learning computer science for the first time and programmers learning to deepen their understanding of algorithms now if you'd rather just have fun but in a challenging an intellectual way I recommend the puzzle science course it's brand new very neat you'll develop a solid foundation in physics while playing with puzzles and lastly if neither of those appeal to you surely you'd enjoy beautiful geometry where you can learn about things like the mathematics of origami effective learning is about problem solving and brilliant will help you learn and get practice you'll come away better at problem solving so check out brilliant org slash data skeptic once more that's brilliant dot org slash data skeptic so this all sounds like a neat idea of you know hey maybe it worked great maybe this could understand lang would your learned patterns in language but you gotta do something empirical right decide like is this actually a good framework to solve these sorts of natural language problems so what they did was with Burt was they first trained it as it was intended they got a really huge corpus and then just trained this unsupervised embedding model that's essentially trying to predict masked words by looking at what comes before and after it and building up a layer of Transformers that can hopefully capture some of the the knowledge that needs to be there to represent the actual information or ideas that are there that's what the embedding is supposed to emulate the embedding is supposed to emulate ideas yeah like a numeric version of an idea and the embedding happens on the burt side or where yeah Burt is a model I mean it's an algorithm it's all these things but what you use it for in like a industry system or whatever is to take in raw text and get as output a fixed-length vector that just numbers so what are those numbers useful for so they can be used in a variety of different ways one way you could use them is for similarity that vector can be thought of as a point in n-dimensional space so just like you know we picture 2d things right because we can draw it there's x and y and there's points on the plot can you picture that there's an exit oh I and then there's points okay typical graphs and stuff right now you can't picture maybe you can picture 3d plots too but I know you can't picture 40 and above and this is in like a hundred and twenty four dimensional space probably so you can't picture it but all the mathematics works the same so think of it in like a two dimensional space I can take any piece of text and then convert it into two numbers now actually two numbers is not enough to convey all the information but we're gonna simplify it here so you give me a sentence like I am so mad at this company and I convert that into the two numbers let's say it's 1.1 X and 1.3 Y and then you could take another sentence like this is my favorite company ever and that would convert into two numbers as well and they would probably be kind of far apart because those ideas are different whereas if you converted another sentence like I am so disappointed in this corporation none of the words were the same as my first example but essentially I said the same thing right so you would hope that it assigns relatively speaking the same numeric values to both of those sentences right okay I'm following the degree to which that happens you get to take advantage of that so you could say like cluster documents and say these are all similar but actually you can use this bert stuff on an extremely wide array of tasks so there's a classic set of problems in natural language processing like spell checking and well that's an easy one but sentence disambiguation and named entity recognition and stuff like that that you might want to do against text most of the time historically people have built a specialized algorithm to try and solve those problems that really tackles that exact use case so for example if you want to build a chatbot that does question answering you would only work on question answering you wouldn't try and make the bots so that it could also do part of speech tagging you know what I mean okay I'm following you got me all right yeah you build specialized tools but with Bert if you just take that vector that numeric representation and you use it as features and some machine learning exercise Bert seems to be beating all the best known algorithms on all those generic tasks this wide array of tasks in fact but it's not specialized to any of them so it's this like much more general case tool than we've had before in natural language processing pretty awesome huh so I'm not sure why is it awesome because if you want to do a project in natural language processing like make a document classifier or build a special filter for your email or something like that to do it with machine learning you need a huge amount of training data and a lot of compute power to train a model now your own email even though it might seem like you know feel like you get a lot of email or something like that in your life probably in the grand scheme of things it's not that much it could probably fit on a small disk so you don't have a lot of examples for training and you can't get more right unless you just want to write emails all day so for whatever reason a lot of problems start from having not that many examples and that makes starting from scratch on machine learning very hard Bert gives you like a rocket booster head start by saying you know here is this unsupervised model that we've trained on general language I think it's on the Wikipedia corpus or maybe and don't quote me on that but it's on something large like that and there are other versions of this that are coming out trained on the reddit corpus and stuff like that but trained on this wide broad example of text so that we can apply transfer learning that we talked about a few weeks ago where essentially you start from a system that has a good ability to detect the context and meaning of certain words and then all you do is refine it for your use case so you don't have the burden of starting from scratch and eating a hundred thousand documents you can get by with maybe a hundred documents because you got this headstart okay so you're saying it it's kind of like a shortcut very much so it's like the whistle and Mario three I don't remember that what is the whistle do again jeez Linda calls the tornado and why is that tornado good because it takes you to the warp zone bert is also a little creepy in how good of a job it can do at predicting next words and you know kind of writing things i'm in short order and we'll announce this on the show when it's ready but gonna have a demo of that that you can use on our site so not ready yet but we'll talk about that in the future when that's checked out so you can actually have a conversation with Bert I don't think you will find that it passes the Turing test but it's very interesting and almost a little bit creepy all right well I look forward to being creeped out well then to have I brought you to a point where you appreciate the milestone which Bert represents well I think you definitely shed a little bit more light than I had before so thank you all right it's like the whistle and Mario well that's a yeah a little bit blunt that just in terms of how can accelerate your natural language processing project but it's a yeah it's a pre-training you can extend takes up a heck of a lot of memory to use but well worth it awesome anyway thanks as always for joining me Linda Thank You Kyle [Music]
Original Description
Kyle provides a non-technical overview of why Bidirectional Encoder Representations from Transformers (BERT) is a powerful tool for natural language processing projects.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Skeptic · Data Skeptic · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Data Skeptic book giveaway contest winner selection
Data Skeptic
OpenHouse - Front end and API overview
Data Skeptic
OpenHouse Crawling with AWS Lambda
Data Skeptic
[MINI] Logistic Regression on Audio Data
Data Skeptic
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
[MINI] Primer on Deep Learning
Data Skeptic
Big Data Tools and Trends
Data Skeptic
[MINI] Automated Feature Engineering
Data Skeptic
The Data Refuge Project
Data Skeptic
[MINI] The Perceptron
Data Skeptic
[MINI] Feed Forward Neural Networks
Data Skeptic
Data Science at Patreon
Data Skeptic
[MINI] Backpropagation
Data Skeptic
[MINI] GPU CPU
Data Skeptic
OpenHouse
Data Skeptic
[MINI] Generative Adversarial Networks
Data Skeptic
[MINI] AdaBoost
Data Skeptic
[MINI] The Bootstrap
Data Skeptic
[MINI] Dropout
Data Skeptic
[MINI] Gini Coefficients
Data Skeptic
[MINI] Random Forest
Data Skeptic
[MINI] Heteroskedasticity
Data Skeptic
[MINI] ANOVA
Data Skeptic
Urban Congestion
Data Skeptic
[MINI] The CAP Theorem
Data Skeptic
Unstructured Data for Finance
Data Skeptic
Detecting Terrorists with Facial Recognition?
Data Skeptic
Predictive Models on Random Data
Data Skeptic
[MINI] Entropy
Data Skeptic
[MINI] F1 Score
Data Skeptic
Causal Impact
Data Skeptic
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
The Library Problem
Data Skeptic
Stealing Models from the Cloud
Data Skeptic
Data Science at eHarmony
Data Skeptic
Multiple Comparisons and Conversion Optimization
Data Skeptic
Election Predictions
Data Skeptic
[MINI] Calculating Feature Importance
Data Skeptic
MS Connect Conference
Data Skeptic
Music21
Data Skeptic
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
Studying Competition and Gender Through Chess
Data Skeptic
[MINI] Goodhart's Law
Data Skeptic
Trusting Machine Learning Models with LIME
Data Skeptic
[MINI] Leakage
Data Skeptic
Predictive Policing
Data Skeptic
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
[MINI] Convolutional Neural Networks
Data Skeptic
Unsupervised Depth Perception
Data Skeptic
[MINI] Max-pooling
Data Skeptic
MS Build 2017
Data Skeptic
Activation Functions
Data Skeptic
Doctor AI
Data Skeptic
[MINI] The Vanishing Gradient
Data Skeptic
CosmosDB
Data Skeptic
Estimating Sheep Pain with Facial Recognition
Data Skeptic
[MINI] Conditional Independence
Data Skeptic
MINI: Bayesian Belief Networks
Data Skeptic
Project Common Voice
Data Skeptic
[MINI] Recurrent Neural Networks
Data Skeptic
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How AI Learns with Less Labeled Data
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Medium · LLM
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Dev.to · Stanislav
How I'm re-discovering computer science with LLM revolution
Dev.to · popiol
🎓
Tutor Explanation
DeepCamp AI