Convolution in NLP

CodeEmporium · Advanced ·🧬 Deep Learning ·2y ago

Skills: LLM Foundations80%LLM Engineering60%

Key Takeaways

The video discusses the application of convolution in Natural Language Processing (NLP) and its relevance to Large Language Models (LLMs). It provides an overview of the concept, its history, and its implementation in various NLP tasks.

Full Transcript

hello everyone in this video we're going to talk about how convolution is used in natural language processing and yes it's used in more than just computer vision so let's get to it so convolution is the application of one function onto another function in order to get a third function and this operation is going to be an element-wise product followed by a sum so if we take a look at this with an actual example let's say that we have some function that produces a list a and another second function that produce some list B and now we want to perform this convolution operation this is essentially going to be an element wise product followed by a summation so in this case we have a that's given over here in red and we'll slide the elements of B one at a time and in each case we will take the sum of products so in this case over here we have 1 times 5 plus all of these are padded with zero one times five is five and we have six times one plus five times two that's going to be 16. seven times one plus six times two plus five times three that's thirty four and we slide this into the last element of the first function aligns with the last element of the second function in this case that's four times eight which will be 32. and so the output here is going to be a list of Seven Elements now this kind of traditional convolution in a lot of the neural network and deep learning space if you read a lot of papers it's going to be called a wide convolution because we start by matching the first element of the first function with the first element of the second function in order to form the convolution this is a little different from the convolutions that we would typically see in image processing so for example we have an input image which is this green Matrix and on top of that we are sliding a kernel or a filter which is this yellow Matrix we take an element wise product and then sum all of them in order to get the individual elements of this pink Matrix this here is the convolved feature now one thing to note about this convolution is that it's not the same as the wide convolution that we saw before and in fact we see that the filters almost entirely within the image while if we were to actually perform a wide convolution we would have used some padding around this image so that the filter can slide across every single element equally the idea of convolution here and a lot of deep learning applications is that by changing some of the filtered values and applying it on an image we can extract different features about this image while convolution neural networks are very useful in process processing image data they are also useful in processing temporal data now temporal data is data that has some sequence or ordering to it for example language or speech one of the earliest uses of this convolution in a neural network was actually in 1989 for phoneme detection that is for a given input sound wave we determine what is the phoneme or the sound that is produced in this case it's a classification problem so in this case we take a raw speech wave we chunk it up into 10 millisecond bytes and each of those 10 milliseconds is encoded into a 16 dimensional Vector so it's represented by 16 numbers and those 16 numbers are Mel scale filter Bank coefficients let's talk about how we're applying convolution to this architecture in order to do phoneme detection so with convolution we know that we have to perform an element wise product followed by a summation in this case we are going to take a sliding window of size 3 so we're dealing with three elements at a time each element is going to be one sliver of that sound wave that is this sliver here is going to be a 10 millisecond sound wave which is a one cross 16 Vector then we have another one cross 16 vector and a third one cross 16 vector now for each of these we're going to well convolve again we need to convolve three other elements and these three elements you can take them as matrices of 16 Cross eight so we have three vectors of one cross sixteen we convolve it with three matrices of 16 Cross eight and when you do like an element-wise product with them well if you do a 1 cross 16 multiplied by a sixteen cross eight you get a one cross eight Vector for each case now we end up with three one cross eight dimensional vectors now we take the sum which well if you sum up three vectors of one cross eight you get another one cross a vector and so if we perform the convolution of just this box over here this region right here we're going to get this one cross eight Vector over here so that's an element wise product followed by a summation and so one thing to note is that the number of parameters here is going to be three times sixteen times eight that's the sliding window next we slide it over this way temporally by one unit and now we enter well let's change the color a little bit so we can see it we now come to this region over here and we apply the same convolution operation that's an element-wise product to get three eight cross one vectors and then you sum them up to get one one cross a vector and that'll be the second element which is right over here and we slide this even more by another time step and we do this so on until we get this final Vector over here and now we have another layer of vectors where time is in this direction and each is represented by eight elements in this direction we perform another convolution operation but in this case we have a window size of five instead of the three that we had in the first layer and we're going to convolve it with basically five matrices of H cross three and so when we do the convolution you'll get like five three cross one vectors you take the sum of them and that's only going to lead to one three cross one vector and in the similar case we create we Slide the window create every single Three cross one vector over here now at this stage what we do is we're going to ignore the time Dimension and basically take the maximum value across the entire time dimension and we do that three times over here and once we determine these Max values we can then determine well which classification this belongs to whether we set a book or a d or a g for a more detailed explanation on time delay neural networks I've written an accompanying blog post that describes exactly the operation that I just discussed but it's a lot more written detail with some more mathematics so if you are curious I highly suggest you check this out the link will be in the description below now these times delay neural networks are fantastic as they allow for a way for us to process temporal data something that neural networks weren't able to do in the past however at this time 1989 these neural networks can really only solve very simplistic natural language tasks such as just like phoneme classification and nothing that's a little more complex a lot of this is because of less advances in terms of software as well as Hardware but a lot of that did change over the coming decades for example on the software front in 2001 we saw the introduction of this paper called neural probabilistic language models before this word vectors used to be represented by very sparse matrices that made it very difficult for computers to process however with this paper we would represent individual words with very dense and continuous vectors so this means that now like the word like Mouse King and Queen would be represented by vectors that are much more tractable this could be like let's say 64 Dimensions or 128 Dimensions something that can be much more easily processed by a computer and on top of that these word vectors if learned properly they could encode meaning which means that words that are similar together would be closer to each other so in this case like king and queen could be represented by numbers that are closer to each other than they are to let's say mouse which is very different in meaning apart from the software advances there were there was a really instrumental Hardware change that did occur that affected how NLP evolved and this was the introduction of Cuda by Nvidia in 2007. Cuda acts as an interface between the developer as well as the GPU and so it allows us to make use of the advantages that gpus provide that is parallel computation and with the rise of neural networks which were designed to handle and process inputs and data in parallel this became a huge game changer that actually revolutionized not just natural language processing but even the Deep learning Revolution itself and so with the accompanying software changes the hardware changes and also the availability of more and more data between 2008 and 2011 we saw the Renaissance of time delay neural networks once again but this time using much more complicated NLP tasks now let's talk a little bit about exactly how these time delay neural networks were used for solving much more complex tasks so the cool thing about time delay neural networks are that they unlike traditional neural networks could process sequential and temporal data so let's say that we had an input sentence the cat sat on the mat now each of these words as we mentioned can be represented by continuous dense vectors and that here is this ltw1 this is a continuous dense Vector representation of well the padding this is the continuous dense representation of the word the this is the word cat and this is the word sat on and matte and so on now we can have aside from just this core Vector that represents the meaning of the word we can encode other certain features of let's say of this word for example we can encode the part of speech that the word the is or cat or sat is and this could be a feature two and this could be another like big Vector representation another feature like a feature three could be the stemmed version of each of these words that could also be represented as some vector and we would then have let's say k such features which we would concatenate and so we would have like a really tall long Vector which we would say is of size d to represent each and every single word here and note that each of these embeddings and parameters need to be learned and they will be learned during this training process and so from this lookup table of values we perform a convolution operation now the convolution operation will consist of this kernel M1 that we apply let's say in this case in three at a time so we're doing an element wise for three elements we are performing an element-wise product followed by a summation and this is going to essentially just be like three matrices like we talked about before in the original time delay neural network in 1989 and so when we apply the convolution here and sliding window we will get a new set of vectors after this convolution operation we'll ignore the time Dimension and then perform a Max pooling that is that for every single one of the dimensions we will take the highest activation value and so with the max pooling operation we're going to end up with a fixed size Vector of N1 Hu this is really interesting because no matter whatever the sequence length is for the input how many ever words there are in the sentence we will always end up with this same fixed length vector and when you have a fixed length Vector we can now layer on top of it any of the traditional neural network feed forward fully connected layers and so in order to learn more complex relationships we can now layer in a linear layer followed by an activation and another linear layer for example and with an additional convolution now after this we could actually you know make it go into like a soft Max layer to perform let's say part of speech recognition or we can also you know tune it to become a language model or any other natural language task and so the cool things about this time delay neural network approach for now solving more complex natural language tasks is that the ordering of words is considered and parameters can be shared and this this is just the faceter nature of convolution operation where the Learned parameters are going to be the the context window and that context window is shared parameters that we simply slide across the input and the sentences can be of varying length the cons however is that it can be a pretty complex operation especially if you're just trying to understand what the word embeddings or learning the word embeddings are and also the max pooling may be a very oversimplifying operation especially since we're only grabbing one activation within a large sequence but within a large sequence there could have been multiple activations that would have been very strong and we might be missing out on some signals there to deal with this first con of complex nature of learning word embeddings well this can kind of be solved with the word to VEC architecture which was later introduced in 2013. I have an entire video on how word to VEC and different architectures like it work but essentially we can learn word and bettings with very simple architectures and the other con here which is that Max pulling may be an oversimplifying operation this can be solved with something called Dynamic convolution neural networks which we'll take a look at now a dynamic convolution neural network is a type of convolution neural network but its architecture really depends on the length of the input sentence and so it is dynamic in this case let's say that we have an input sentence the same one called the cat sat on the red mat and so this is a sentence with seven words and so we have let's say that each of these words is represented by like a four dimensional vector and so we see four dimensions and we have seven words over here now we can perform a wide convolution and let's say that we perform a y convolution with let's say a kernel of size three and so when you perform a y convolution like we talked about before the result is going to be the input length which is 7 plus the kernel length which is three minus one so that's seven plus three minus one which is nine and that's why you'll see like nine elements over here and we have two of these because let's say that you know we apply one type of Kernel to extract one set of features of the input and we can apply another kernel or another type of filter in order to extract other kinds of features from this same input and so this image just shows two kind of kernels but we can have many now from this we are going to perform a dynamic pulling operation before what we would do is just take the max amoma activation across this entire input regardless of how many elements there are but with Dynamic Max pulling we're are going to not only take the top one but the top K activations and what K is really is a function of the input sentence length seven you can compute it with this formulation over here if you do all of that math you will see that it turns out to be five and so we'll take the five largest activations across the time Dimension and that is how we get the next pulling layer and then we perform a similar convolution and pulling operation and eventually flatten this in order to get a fully connected layer and once we have a fully connected layer we can do any traditional convolution operation on top of it so a cool thing about this Dynamic CNN is that the architecture itself is dynamic and less information is lost if for especially like much longer sentences however con here is just the Core Essence of the convolution operation where we cannot model long-term dependencies explicitly very well so for example when we perform a sliding window convolution let's say in this case we have a convolution with window is three that means that like each of these elements or each of these words are really only interact with the words that are just in its immediate vicinity because it's limited by the kernel width and because the width is only three it'll only really look at its neighbors so well in order to look much further you would need to increase the kernel size but the problem with convolution is that if you increase the kernel size the number of parameters that are required to be learned scales quadratically and so it's not a super effective way especially for learning very long-term dependencies as sentences get longer and longer or when we're processing paragraphs or essays however this is kind of solved with more recent Technologies on Long short-term memory rnns as well as Transformer neural networks I've spoken about both of these in their individual videos in great detail so I do highly recommend you checking those videos out for more information essentially lstm rnns have preserved some form of memory and so so for even later input words in a sentence they still have context of words that might occurred much more previously and more recently we have Transformer neural networks which use the concept of attention to understand dependencies between words even if those words are much further away and this attention-based architecture is kind of how a lot of the large language models perform today including chat gbt and so while the convolution operation is useful in natural language processing there has definitely been some advances where it has taken some sideline more in recent times at least in NLP that said I do feel like understanding the convolution operation and its Essence in NLP is super important in order to understand where we are today and how we got to like the large language model world that we are in right now and with that thank you all so much for watching I'm gonna link this article that I've written with a lot more details down in the description below so please do check it out and also do follow me on medium if you can and I will be seeing you in another one bye-bye

Original Description

Let's talk about how convolution is used in NLP Medium Article for this video: https://medium.com/p/573d0329cc37#e45e-15748eb5fa7d SPONSOR Get 20% off and be apart of a Premium Software Engineering Community for career advice and guidance: https://www.jointaro.com/r/ajayh486/ ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ PAPERS [1 🔎] Phoneme detection paper with Time Delay Neural Networks: https://www.cs.toronto.edu/~fritz/absps/waibelTDNN.pdf [2 🔎] TDNN rise again in 2011: https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf [3 🔎] A Neural Probabilistic Language Model (Bengio et al., 2003): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf [4 🔎] Dynamic Convolution Neural Networks: https://arxiv.org/pdf/1404.2188.pdf USEFUL VIDEOS [1 🔴] Word2Vec video: https://youtu.be/9S0-OC4LFNo [2 🔴] LSTM Video: https://www.youtube.com/watch?v=QciIcRxJvsM [3 🔴] Transformer Video: https://www.youtube.com/watch?v=TQQlZhbC5ps [4 🔴] BERT video: https://www.youtube.com/watch?v=xI0HHN5XKDo [5 🔴] ChatGPT video: https://youtu.be/NpmnWgQgcsA PLAYLISTS FROM MY CHANNEL ⭕ Transformers from scratch playlist: https://www.youtube.com/watch?v=QCJQG4DuHT0&list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4 ⭕ ChatGPT Playlist of all other videos: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Transformer Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches the application of convolution in NLP, covering its history, implementation, and relevance to LLMs. It provides a comprehensive overview of the concept and its importance in deep learning.

Key Takeaways

Understand the basics of convolution in NLP
Learn about Time Delay Neural Networks and their application in phoneme detection
Study the implementation of Dynamic Convolution Neural Networks
Explore the relevance of convolution in LLMs and their applications

💡 Convolutional techniques can be effectively applied in NLP tasks to improve the performance of LLMs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Understanding Deep Learning Through Four Interactive Experiments

Explore deep learning concepts through interactive experiments to gain hands-on understanding

Medium · Data Science

Understanding Deep Learning Through Four Interactive Experiments

Explore deep learning through interactive experiments to gain hands-on understanding

Medium · Deep Learning

Optimizers in Deep Learning: From Gradient Descent to Adam

Learn how optimizers in deep learning work, from basic Gradient Descent to advanced Adam optimizer, to improve model training

Medium · Deep Learning

The Meta-Architecture of Interface Fracture: High-Dimensional Logical Stress and Systemic Collapse…

Learn about the meta-architecture of interface fracture and its relation to high-dimensional logical stress and systemic collapse in deep learning systems

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train