GPT - Explained!
Key Takeaways
The video explains the fundamentals of GPT, GPT-2, GPT-3, and ChatGPT, covering topics such as transfer learning, fine-tuning, and meta-learning, with a focus on language modeling and self-supervised learning. It highlights the differences between GPT models, including zero-shot learning, one-shot learning, and few-shot learning, and discusses the advantages and disadvantages of fine-tuning and meta-learning.
Full Transcript
hello everyone welcome to another episode of Code Emporium where we're going to talk about GPT so I've structured this video as a flow from Transformer neural networks to gpt3 and then eventually chat GPT I'm hoping it'll help grasp the overall landscape of language modeling by doing this so let's get to it and for more videos like this consider subscribing Transformers are sequence 2 sequence architectures they convert one sequence to another sequences have a defined ordering sentences for example are a sequence of words and so these Transformers can also be used to solve natural language problems such as text translation to train these architectures however we need a ton of labeled data on that specific task this would be difficult for Transformers or any other model to learn so how would we make it easier for models to learn with less data think about it [Music] correct the answer is transfer learning what a smarty so let's combine the Transformer neural network with transfer learning Transformers have two parts an encoder and a decoder each of them is able to learn a good representation of language so good that we can create language models from each part you stack the encoders to get a bi-directional encoder representation of Transformers that's Bert and you stack the decoder units and we can get generative pre-trained Transformers or GPT each of these architectures have created their own lines of research in this video I'll be focusing on GPT but for more information on Bert I have other videos that you can check out now before we nose dive into GPT let's talk about transfer learning training a model from scratch requires a lot of data because the parameters were randomly initialized but what if the parameters just happened to be initialized to values that are close to the values that we need well in this case we don't really need too much data to get to where we need to so here's a situation we have some model that has randomly initialized parameters it's then trained on some first task then these parameter values would have been updated because of that training now this model has some sort of knowledge so to speak and we can use this base knowledge to further train with data from another task and this is akin to transferring Knowledge from one task to another task and hence the name transfer learning this is the exact idea GPT and Bert use in this context the gbt training is thus divided into two parts we have pre-training where we train the GPT architecture to understand what language is and then fine tuning where we use transfer learning to further train the GPT architecture to perform well on specific language tasks let's talk a little bit about each so GPT is pre-trained on the task of language modeling this is essentially a task where the model is given random sentence parts and is made to predict the word that will come next why language modeling this is chosen to act as a good base for understanding the fun fundamentals of language and can be easily fine-tuned language modeling is often referred to as a self-supervised task as the sentences themselves form the input and the output labels in some papers you might see this as unsupervised learning the GPT fine-tuning task depends on what task we want to perform this could be text translation question answering or text summarization among many others these are typically supervised tasks that we would provide training data for with inputs and labels this approach works because we end up with a good model that requires less data than we would originally need had we train the model from scratch however there are some issues with this fine tuning approach still too much data is required for every single task we want to accomplish in NLP we still need to collect a data set of hundreds of thousands of examples each this limits what we can do with language models another issue is on overfitting now these models are huge the pre-training data set is Broad but the fine tuning data set is narrow and this may lead to parameter changes that can harm performance we would need to make sure the distribution of our fine-tuned data set is a good representation of what we see in the wild now another issue is logically humans learn from just a few examples whereas fine-tuning requires thousands to hundreds of thousands of examples broadly the direction that we want to take the fields of deep learning and natural language processing is along the lines of human intelligence humans really learn with just a few examples and not a hundred thousand to actually be good at a task and if we do build some system we want it to be able to context switch very fluidly for example we want them to interleave between actually talking text and then Computing some small map operations in between because language sometimes just works in that way and we might need to make calculations off the fly while we are talking mid-sentence now one potential solution to address these concerns is meta learning this approach was introduced in the next version of gbt that is gbt 2. gpt2 is similar to the original GPT model in the sense that it still has the same pre-training phase with language modeling but instead of the fine-tuning approach we would use something called zero shot learning zero shot learning entails that we don't really make any parameter updates once the model has been pre-trained instead when we want to make an inference during inference time we'd pass in the input as we would usually do but also pass in a prompt that says what instruction should be done with the input the issue with this approach is that zero shot learning is very hard for the model so we need to scale the architecture up to capture as many patterns in the language as we possibly can during pre-training gpt2 was trained with 1.5 billion parameters for this reason the approach though did not perform as well on fine-tuning for a number of benchmarks however scaling the architecture did indeed still help performance in some way continuing this line of thought what would happen if we use the same strategy of meta learning but we scale the architecture even more and this is what led to the third generation of gbt models gpt3 is the large language model trained with 175 billion parameters like its former GPT and gpt2 predecessors it was pre-trained with the language model objective and then it was fine-tuned with the meta learning objective but instead of just zero shot learning as we would have done in just gpt2 it could be one of the meta learning techniques such as zero shot learning One-Shot learning and even few shot learning so let's talk about each zero shot learning as we had mentioned before is where we just feed a prompt along with our input with this there is less of a chance of strange correlations compared to fine-tuning and also our model would be more robust the disadvantage though is that it's really difficult for even humans to start without a single example so this strategy is considered unfairly hard for the model then we have One-Shot learning along with what we feed for zero shot learning we also feed an example of what we want all of this is pushed as a vector to what we call a model context window and then we have few shot learning exactly like One-Shot learning but instead of just one complete example we feed multiple examples this could typically range from like 10 to 100 examples or whatever fits in the model's context window overall gpt3 has been pretty good even sometimes outperforming it's fine-tune counterparts on certain tasks now in conclusion I just want to say that fine-tuning and metal learning have their own advantages and disadvantages meta learning has not clearly supplanted fine-tuning after all in the first version of chat gbt released in December 2022 charge EBT actually has a fine-tuned GPT model at its core which shows some promise still in that direction and overall there's always something that's evolving in the field so it's exciting to follow along now that's where I'm going to end the video and thanks so much for watching check the description for some fun resources and videos that probably have my face on it and I will see you all in the next one bye
Original Description
Let's talk about GPT, GPT-2, GPT-3 and ChatGPT in 10 minutes
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
RESOURCES
[ 1🔎] GPT-3 Main Paper: https://arxiv.org/pdf/2005.14165.pdf
[2 🔎] GPT-2 Main Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[3 🔎] GPT original paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
[4 🔎] A very Nice intuitive understanding of GPT-3 architecture: https://dugas.ch/artificial_curiosity/GPT_architecture.html
PLAYLISTS FROM MY CHANNEL
⭕ ChatGPT Playlist of all other videos: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Transformer Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Probability: https://imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning
📕 Python for Everybody: https://imp.i384100.net/python
���
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: LLM Foundations
View skill →
🎓
Tutor Explanation
DeepCamp AI