BERT

Data Skeptic · Beginner ·🧠 Large Language Models ·6y ago

Skills: LLM Foundations90%Fine-tuning LLMs70%Unsupervised Learning60%

Key Takeaways

The video discusses BERT, a powerful tool for natural language processing projects, and its applications in various tasks such as spell checking, sentence disambiguation, and named entity recognition, using techniques like fine-tuning and retrieval augmented generation.

Full Transcript

[Music] all right Linda our topic for today is Burt that's a BER T just like the name which stands for bi-directional encoder representation from transformers so I guess it's actually birthed but they went with Burt now that's a very technical name and I don't want to even get into all the technicalities with you on this show because we're gonna do a couple of episodes in a row all about Burt to wind down our natural language processing series so what I'd like to do with you is give you a sense of why Burt is so important why it's a big milestone and why is also really really cool all right let's jump in so Burt is a neural network that takes as input any arbitrary length of text so it could be one sentence could be a paragraph that's really nice because text doesn't fit into a form very well right it's not like every sentence is a certain number of characters now you have that on Twitter right where there's an upper limit but in general text is of arbitrary length and machine learning is not good at handling arbitrarily long things in general so Burt is built on kind of a sequential model but what's nice about it is its output is a fixed length vector you know I remember you brought up in the show but you know what remind me so depending on which version of it you choose you get like either 768 or 128 length of a vector which is just some numeric representation of that text and the very surprising thing or somewhat surprising thing is that those numbers which don't have any obvious meaning are a really good way to do automated feature engineering so they kind of prepare the text into a numeric format so that then traditional machine learning techniques can learn it very quickly so you're saying a fixed vector is something interpreted as numbers it is numbers it's a list of numbers Rushden numbers and it allows this machine to be able to interpret it easily it's almost like translation step into some secret machine language yeah okay so it's just another way to encode something for the machine yes precisely in a way that's very amenable to learning it does that using a trick called masking now we've talked about this before you might remember we talked about word Tyvek where I'd say you know there was a sentence with a missing word and the machine would learn to predict it based on the context machine can learn it like the sentence my blank was late this week my blank my train my check was late this week my bus my dinner was late this week wow you're all about like food and paycheck so okay word Tyvek would do a good job guessing what fits in that blank but now let me give you a place where word Tyvek wouldn't necessarily gain any more advantage what if the pre preceding parts of that were I need to call my boss my blank was late this week well it's probably a check yeah more likely a check but it could be my report was late this week my vendor was late this week you know it could be lots of stuff like that but yeah check seems to fit very nicely I mean if you guys send packages maybe your package came late oh yeah good one my delivery was late this week I need I need to call my boss my delivery was late this week yeah so these are all very likely candidates to fit in that blank and you and I know that from our experience with language Burt learns kind of the same way it looks at a massive massive training corpus and says what kind of patterns do I see here and from that it's able to kind of learn the context and what surrounding words do to help inform the middle word and that's the same idea as word Tyvek that we've had for awhile but what Burt does a little better is through this transformer process encode information and keep it around a little bit longer almost as though it has a memory so let me continue with our example here and I'll give you two different sentences and I'm gonna ask you for candidate words at the blank okay all right I'm ready I am really really worried about losing my job I need to call my boss my blank was late this week well at that point I'm probably thinking it's a deliverable from that probably a report or something whatever they're on the hook for okay here's another sentence my company is going through some weird financial things right now I need to call my boss my blank was late this week oh well in that case it's probably a check yeah right so here we are two sentences before the blank and there's information there you need to correctly figure out what goes in the blank well you know it helps it definitely give us context yeah and that context is seems to be required to solve problems like this so Bert is leveraging a lot of the best ideas in deep learning and frameworks and techniques and that sort of thing to build up based on a massive training set a model that translate any text you have into an embedding so it's a numeric representation that contains the concept and what sort of core idea is represented there and that can then be translated back into words thanks to this week's sponsor brilliant org brilliant is a problem-solving website and app with a hands-on approach and over 50 interactive courses these courses are a mix of storytelling code writing interactive challenges and problems to solve I'm gonna highlight a few four different types of listeners if you don't already have these skills sign up immediately for computer science essentials by visiting brilliant org slash data skeptic there you can dive into big ideas in algorithm design that course is designed for anyone learning computer science for the first time and programmers learning to deepen their understanding of algorithms now if you'd rather just have fun but in a challenging an intellectual way I recommend the puzzle science course it's brand new very neat you'll develop a solid foundation in physics while playing with puzzles and lastly if neither of those appeal to you surely you'd enjoy beautiful geometry where you can learn about things like the mathematics of origami effective learning is about problem solving and brilliant will help you learn and get practice you'll come away better at problem solving so check out brilliant org slash data skeptic once more that's brilliant dot org slash data skeptic so this all sounds like a neat idea of you know hey maybe it worked great maybe this could understand lang would your learned patterns in language but you gotta do something empirical right decide like is this actually a good framework to solve these sorts of natural language problems so what they did was with Burt was they first trained it as it was intended they got a really huge corpus and then just trained this unsupervised embedding model that's essentially trying to predict masked words by looking at what comes before and after it and building up a layer of Transformers that can hopefully capture some of the the knowledge that needs to be there to represent the actual information or ideas that are there that's what the embedding is supposed to emulate the embedding is supposed to emulate ideas yeah like a numeric version of an idea and the embedding happens on the burt side or where yeah Burt is a model I mean it's an algorithm it's all these things but what you use it for in like a industry system or whatever is to take in raw text and get as output a fixed-length vector that just numbers so what are those numbers useful for so they can be used in a variety of different ways one way you could use them is for similarity that vector can be thought of as a point in n-dimensional space so just like you know we picture 2d things right because we can draw it there's x and y and there's points on the plot can you picture that there's an exit oh I and then there's points okay typical graphs and stuff right now you can't picture maybe you can picture 3d plots too but I know you can't picture 40 and above and this is in like a hundred and twenty four dimensional space probably so you can't picture it but all the mathematics works the same so think of it in like a two dimensional space I can take any piece of text and then convert it into two numbers now actually two numbers is not enough to convey all the information but we're gonna simplify it here so you give me a sentence like I am so mad at this company and I convert that into the two numbers let's say it's 1.1 X and 1.3 Y and then you could take another sentence like this is my favorite company ever and that would convert into two numbers as well and they would probably be kind of far apart because those ideas are different whereas if you converted another sentence like I am so disappointed in this corporation none of the words were the same as my first example but essentially I said the same thing right so you would hope that it assigns relatively speaking the same numeric values to both of those sentences right okay I'm following the degree to which that happens you get to take advantage of that so you could say like cluster documents and say these are all similar but actually you can use this bert stuff on an extremely wide array of tasks so there's a classic set of problems in natural language processing like spell checking and well that's an easy one but sentence disambiguation and named entity recognition and stuff like that that you might want to do against text most of the time historically people have built a specialized algorithm to try and solve those problems that really tackles that exact use case so for example if you want to build a chatbot that does question answering you would only work on question answering you wouldn't try and make the bots so that it could also do part of speech tagging you know what I mean okay I'm following you got me all right yeah you build specialized tools but with Bert if you just take that vector that numeric representation and you use it as features and some machine learning exercise Bert seems to be beating all the best known algorithms on all those generic tasks this wide array of tasks in fact but it's not specialized to any of them so it's this like much more general case tool than we've had before in natural language processing pretty awesome huh so I'm not sure why is it awesome because if you want to do a project in natural language processing like make a document classifier or build a special filter for your email or something like that to do it with machine learning you need a huge amount of training data and a lot of compute power to train a model now your own email even though it might seem like you know feel like you get a lot of email or something like that in your life probably in the grand scheme of things it's not that much it could probably fit on a small disk so you don't have a lot of examples for training and you can't get more right unless you just want to write emails all day so for whatever reason a lot of problems start from having not that many examples and that makes starting from scratch on machine learning very hard Bert gives you like a rocket booster head start by saying you know here is this unsupervised model that we've trained on general language I think it's on the Wikipedia corpus or maybe and don't quote me on that but it's on something large like that and there are other versions of this that are coming out trained on the reddit corpus and stuff like that but trained on this wide broad example of text so that we can apply transfer learning that we talked about a few weeks ago where essentially you start from a system that has a good ability to detect the context and meaning of certain words and then all you do is refine it for your use case so you don't have the burden of starting from scratch and eating a hundred thousand documents you can get by with maybe a hundred documents because you got this headstart okay so you're saying it it's kind of like a shortcut very much so it's like the whistle and Mario three I don't remember that what is the whistle do again jeez Linda calls the tornado and why is that tornado good because it takes you to the warp zone bert is also a little creepy in how good of a job it can do at predicting next words and you know kind of writing things i'm in short order and we'll announce this on the show when it's ready but gonna have a demo of that that you can use on our site so not ready yet but we'll talk about that in the future when that's checked out so you can actually have a conversation with Bert I don't think you will find that it passes the Turing test but it's very interesting and almost a little bit creepy all right well I look forward to being creeped out well then to have I brought you to a point where you appreciate the milestone which Bert represents well I think you definitely shed a little bit more light than I had before so thank you all right it's like the whistle and Mario well that's a yeah a little bit blunt that just in terms of how can accelerate your natural language processing project but it's a yeah it's a pre-training you can extend takes up a heck of a lot of memory to use but well worth it awesome anyway thanks as always for joining me Linda Thank You Kyle [Music]

Original Description

Kyle provides a non-technical overview of why Bidirectional Encoder Representations from Transformers (BERT) is a powerful tool for natural language processing projects.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 0 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video provides a non-technical overview of BERT, its applications, and its potential in natural language processing projects. BERT is a powerful tool that can be used for various tasks such as spell checking, sentence disambiguation, and named entity recognition. The video discusses how BERT uses a transformer process to encode information and keep it around longer, almost like having a memory, and how it can be fine-tuned for specific tasks.

Key Takeaways

Understand the basics of BERT
Apply BERT in NLP tasks
Fine-tune BERT models
Use BERT for similarity tasks
Use BERT for spell checking, sentence disambiguation, and named entity recognition

💡 BERT's ability to generate a fixed-length vector from raw text makes it a powerful tool for various NLP tasks, and its pre-trained model gives a head start in machine learning by applying transfer learning.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective

Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective

Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro

Compare the debugging capabilities of DeepSeek V4 Pro and MiMo V2.5 Pro on a real-world GitHub bug

Dev.to · Stanislav

How I'm re-discovering computer science with LLM revolution

Reinvigorate your computer science knowledge with the LLM revolution and discover new applications and techniques

Dev.to · popiol

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)