Informer attention code - FROM SCRATCH!

CodeEmporium · Advanced ·🔢 Mathematical Foundations ·2y ago

Skills: LLM Engineering90%Prompt Craft60%

Key Takeaways

The video demonstrates how to code Informer attention from scratch, specifically implementing prob sparse attention and comparing it to time series attention using Python and referencing the Informer2020 repository.

Full Transcript

greetings fellow Learners now before we get into this world of coding attention I have a thought-provoking question for you how important is it to learn to code the architecture of these complex models and can you give your reasoning as well so personally I think as a machine learning engineer we can extract the core ideas and Concepts in the architecture and use it elsewhere in more tenable models that we build and I think looking at the code really brings a concrete picture in your head about the implementation too and hence I think videos like this that show code of an entire architecture can be pretty valuable but that's just my opinion please let me know your opinion down in the comments below and I'd love to hear your thoughts now this video is going to be divided into three main passes where we're going to talk about full attention code in the pass one then prop sparse attention in pass two and then end with a Time space complexity analysis so let's get to it this here is the code for the Informer architecture specifically for attention so we have a class for full attention and then we have a class for the prob sparse attention so what I'm going to do with this first pass is we're going to talk about it this entire architecture of attention in three levels so the first is a mathematical explanation followed by an architectural explanation and then we're going to dive into each line of this code in a clab notebook so let's get started so we have full attention here and it's given by this mathematical representation where we apply um a query Vector multiplied by a key Vector in order to get affinity for every time step with every other time step we then scale it and perform softmax and then we apply this atten ition Matrix to the value Vector in order to get the full attention now there's a lot of terms in there that are probably a little confusing but I think they'll become more clear as we go on with our explanation so this is the mathematical explanation now let's actually walk through the architecture for full self attention so to start things off let's just say that we have an input which we have of batch size 30 then we'll have 50 time steps that we're passing in all at the same time and each of those time steps has a 512 dimensional embedding we pass this through a feed forward Network in order to create a query Vector a key vector and a value Vector for every single time step and because we have three vectors per time step that's why we have the last Dimension as 1,536 it's 512 that we saw over here * three now we will break these query key and value matrices or tensors into eight attenion heads so that we can perform a tension eight times in parallel and so this 1536 dimensional tensor is going to be broken up into eight parts so that's why we have a 192 dimensional tensor over here next is the Crux of the full self attention where we're going to take this in entire query Matrix and then apply it to the entire key Matrix in order to get a matrix of affinity values for every time step with respect to every other time step next we'll add a padding mask which is used to make sure that every time step will not pay attention to the padding tokens and you add these up in order to get a tensor over here of 30 cross 50 cross 50 next we're going to perform scaling and softmax to get the attention Matrix scaling is required because in this current form the affinities may have very large values in which case performing a soft Max directly can lead to values in the attention Matrix that are very close to one or very close to zero and during the back propagation phase this could lead to very small gradients which can vanish over time and when gradient Spanish it means the network is not learning so in order to stabilize this training in order to make sure that the network continues to learn we will perform some scaling operation and in the math equation we saw that it was simply by dividing by the square root of DQ which is going to be the square root of the number of total time steps that we have next after performing the softmax operation we're going to have this attension Matrix and we will apply it to the value tensors over here in order to get a new set of value tensors with the attention values embedded in them and then we'll concatenate the tensors across all attention heads in order to get our final concatenated tensor and then we have the rest of the attention block which really we don't care about at this point so now that we have a mathematical understanding of the full attention we have a architectural understanding of the full attention architecture let's now go through some code of how do we actually code this out so we'll start with importing some libraries up here and we'll say for Simplicity sake now the number of attention heads is one the batch size instead of being 30 it's just one the sequence length instead of being 50 it's now equal to 10 which is the same as lq LK and LV that is the length of of the sequence of the query key and value tensors and then D model is four indicates the number of features per time step and this will be the same number of features for the query key and value as we only have one attention head and so we're going to initialize our query key and value tensors to be 1 cross 10 CR 1 cross 4 tensors now what we want to do now mathematically and architecturally as we see is we want to perform a matrix multiplication between the query and the key tensors now typically you can just do this with the torch. matat m however I'm going to use a special function called torch. einum it's a special function that can perform addition and multiplication and also performs like a rearrangement of tensors and I'm using it here because you will see this actually in the original Informer code an overview of what this function actually does is this corresponds to to the shape of the query this over here corresponds to the shape of the key and this here corresponds to the shape that we want to have for our output tensor so BH e for the query indicates B is 1 L is 10 H is 1 and E is four next is again for K it's b h e the B is 1 s is 10 H is 1 and E is four and if we want to perform this multiplication we get an output of bhls which is going to be 1 cross 1 cross 10 cross 10 this effectively means that for every item in the sequence size 10 we are getting some Affinity values also of size 10 and this Matrix now looks kind of like this so each element corresponds to the Affinity of a query I with key j and the matrix multiplication requires in here in this case it's going to be four multiplications along with three additions for every single one of these spots and so this operation over here will be performed in the order of lq time l k in terms of space complexity as well as time complexity and so it is quadratic in terms of the input now this line of code over here is the equivalent of what we could have performed instead of using torch. einom we could have just used a torch. mapal where we take the query Vector as well as the key vector and perform squeeze operation this squeeze operation is going to remove all Dimensions which have a value of just one so for example in this case the 1 cross 10 cross 1 cross 4 tensor with the squeeze operation would have been 10 cross 4 and so we would have had the equivalent scores that we saw just up here next we're going to perform a scaling of these scores where DQ is now going to be four and so the square root of 4 is 2 and hence the scaling is 1/ 2 which is 0.5 now we could perform scaling then once again we would apply the softmax operation over here so we have our scaled scores we apply the softmax and the dim equals negative 1 indicates that we want to perform the softmax along the final Dimension that is in this case we know that this is a tensor of 1 cross 1 cross 10 cross 10 we want to perform the softmax on that 10 Dimension so the sum of all of these values over here is going to be one some of these values is going to be one and so on and that's exactly what we see over here next we're going to compute a new value tensor by applying the attention Matrix with the original value Matrix and we're using torch. einum so we have a 1 cross 1 CR 10 cross 10 as a and then we are going to apply the 1 CR 10 CR 1 CR 4 in order to get get this final tensor now the equivalent of this operation is going to be just a simple matrix multiplication now this operation to is going to be quadratic in terms of the input because here too it has the a number of operations that is 10 multiplications and nine additions for every spot in the 10 cross form Matrix so it's also quadratic in terms of the input sequence length and so overall the two main Matrix operations that was when we multiplied the query with the key as well as this case where we're multiplying the attention and the value tensors both of them are quadratic at least in terms of the input and this can be problematic for much longer sequences quiz time have you been paying attention let's quiz you to find out what is the main issue of using full attention for long sequences a the quadratic space and time complexity with respect to input length of the attention operation B limited flexibility in Moder architecture for long sequences c a lack of scalability for parallel processing or D insufficient accuracy for long sequences comment your answer down below and let's have a discussion and if at this time you think I do deserve it please do consider giving this video a like because it will help me out a lot and that's going to do it for quiz time and pass one of this explanation but keep paying attention because I will be back to quiz [Music] you now because this entire attention operation is going to be quadratic in terms of the input sequence length what we do is try to make it more efficient using using the prob sparse attention over here so like we did in the previous pass what we're going to do is walk through the mathematical formulation walk through the entire architecture and then we're going to walk through individual lines of the actual Informer code so let's start with the mathematical intuition over here so you can see that this formulation is the exact same as the formulation that we had seen for full attention except this query is now a Q Bar and this Q Bar is going to be a subset of query vectors itself but in order to get this qar we're actually going to have to perform some other extraneous operations which will become more apparent in the architecture and code now in the architecture let's assume that we're only going to deal with a batch size of one the number of attention heads is going to be one and we're dealing with 10 input sequences at a time and so what happens here is now we have a query tensor of 1 cross 10 cross 4 a key tensor and a value tensor which are both also of the same shape now we can use this to compute just some scalar value called lq bar and LK bar so lq bar is going to be this formulation where this logarithm is going to be a natural log of lq is going to be 10 so natural log of 10 and we're going to perform a ceiling operation of it this value is going to be three and we're going to multiply by some constant factor F which will take us two and you'll see this in the code too and so 3 * 2 is going to be six and so lq bar is going to be six and LK bar over here is going to have a very similar operation which is 2 * the ceiling of the natural log of LK LK is also 10 so we're going to get also the same value of six now why do we take lq bar and LK bar lq bar is going to be the number of query vectors that are going to be eventually selected and LK bar is going to be the number of key vectors that are going to currently be selected in order to compute the subset of query vectors hence we're Computing both of them over here next we're going to use these values like I just mentioned in order to create this key sample tensor which is going to be a 1 cross 10 cross 6 cross 4 so that is for every single query Vector you'll see see here we are going to apply a subset of different key tensors and so for every Q we apply the K sample transpose in order to get this q k sample transpose tensor of 1 cross 10 cross 1 cross 6 for every query we would have applied attention to a set of keys next we are going to compute now an affinity value so while this text is small over here it basically says m is equal to Max of affinity of Q for any K minus the mean Affinity of Q for any K so the idea here is now to try to get the most active queries and this is done by simply taking this tensor that we had over here so for every single query we want to determine the maximum value or the maximum attention that it is given for any key and then subtract it by the mean of all of those attentions and we'll end up with a 1 cross one cross 10 tensor in which case each value is going to be how active the query I is but we're only going to select the top let's say six in this case because lq bar is six we're only going to select the top six of those values and so we end up with a 1 cross 1 cross 6 tensor in mtop and we're going to now just select those actual queries and then we will that's how we get the Q Bar and then now now we can perform the normal self attention operation where we apply both of these in order to get qar K we perform scaling and softmax as we mentioned before scaling is required to prevent Vanishing gradients and ensure that the network will learn and then softmax is required to get this attention Matrix we then apply to the value Matrix to complete attention operation itself and then we are going to update a set of context vectors which are going to be the same set of value vectors just repeated over time and we're going to only update a couple of these in this case out of 10 of them we'll update six of them which will have the active vectors and then there's going to be the other set which are the default context vectors which are the passive vectors and after this phase we will perform some distillation which we have taken a look at in a very high level in a previous video but we will continue to dive into that in a future video so now that we looked at this mathematical operation and we've also looked at the architecture design for prop sparse attention let's actually seal the deal by looking at individual lines of code so we first start off with the same number of attention heads as one batch size is one sequence length is 10 D model is four and also we generate the query key and value tensors in much the same way and we have these 1 cross 10 cross1 one cross 4 matrices we then compute lq bar and LK bar that we talked about before too lq bar is going to be eventually used to select the subset number of query vectors LK bar is more immediately going to be used to select the subset of key vectors so by doing all of the math you would see that we were going to use numpy over here so you can see here it's going to be a factor of two and what we're going to do is it's going to be a factor times the ceiling of the natural log of LK which is 10 and this is overall going to be six and similarly lq bar will also be six now for very short sequences we can actually just perform the full self attention as it's not going to be too costly where this logarithmic value and all of these operations over here come in very useful is for very long sequences and so we are going to use either lq bar or lq accordingly now from this point we're going to perform operations that will help us select the appropriate query vectors for qar and this will involve first Computing this K expand value now this K expand value is going to have for every single query we are going to have the exact same key Matrix so you can see this entire Matrix of 10 cross4 values it's going to be repeated 10 times for every single query we're going to have each of these which is going to be a single key and we're going to have like all these keys that are applied now from a code perspective you'll see UNS squeeze is going to be used UNS squeeze is used to add a dimension whereas expand is going to be used to kind of reshape these values into the batch size cross number of heads cross lq which is 10 cross LK which is 10 cross e which is going to be the embedding size of four and hence we have this tensor now for every single query Vector we will now select LK bar random key vectors and this we're going to be storing in the variable index sample so let's just take this first line to illustrate the example you could see the first item over here is going to be four this means that for the first query we're going to select six key values and the values of the keys or rather the index of the values of the keys that we're going to select are the fourth key the seventh key the second key the zeroth key the eighth key and the sixth key and we can note here that we are allowed to sample these keys with repetition so that's why for the second case you will see over here that we sampled the ninth key Vector for the second query twice now once we have these index samples over here what we can then do is actually select those key vectors from K expand and that's kind of what we do over here using index sample and then we will end up with a torch Matrix of 1 cross 1 cross 10 for every query cross six for every key and four because each key has four embedding dimensions and that's why you can see now that these values will correspond to exactly these keys so for example you saw the second case have let's see what it said here 6994 62 so you could see this was the 6 9 9 and it is true they are repeated four and six again you see it's repeated here and here and then two and so you can kind of parse from here how K sample is going to look now that we have K sample what we're going to do is then perform the matrix multiplication of this K sample along with q and we're using again UNS squeeze to add Dimensions transpose to just flip the last two dimensions and all of this is just to make sure that the Dimensions aligned so that we can perform the matrix multiplication and we'll end up for 1 cross 1 cross 10 for every query we are going to attend to six keys and how this is super important now is that well at least this operation is actually not super expensive in the sense that it's not quadratic an input but it is of the order of n which is 10 log n which is going to be six and so it is not quadratic and hence can save time for longer sequences it's more efficient now next what we're going to do is compute the queries with the highest Affinity or the most active queries and this is going to be Quantified in m so what we're going to do here is we determine the max Affinity of Q for every K minus the mean Affinity of Q for any K so max Affinity is going to be done by this so that's Max of across the column we are just going to get the maximum value and then we will subtract it with the mean of the values of the column to or rather it's going to be the sum divided by LK so it's not truly the mean but it is the mean in a sense so what this actually looks like is this value so this first value 1. 14107 you can see how we computed that let's look at the max value up here so the max value for this specific query the maximum is 17698 and if you compute the sum of all these and divide it by 10 which is LK you're going to get 0.391 and if you subtract 17698 minus 0.359 1 you're going to get 1. 4107 which is exactly the value that you see here and so you can see that it kind of mathematically checks out now what we want to do here is extract the index of the queries that are the highest in this case you can see that we are going to extract the top lq bar queries and these are the index of the queries that are the largest and so we end up with mtop as 1 cross 1 cross 6 but we want to actually get the query values and so we call this qar we will extract those most active query values and these are the six most active query values next we will now use these active queries in qar and then we're going to multiply it with the entire key Matrix as you see in the mathematical formulation above in order to get a matrix of affinity values now this operation is also super interesting because now we are applying a matrix multiplication not which is not necessarily quadratic but it's going to be of the order of n which is going to be k log n that's qar so it's not order of n sare it's order of n log n and so this is a more efficient matrix multiplication even for longer sequential inputs now we apply a scaling operation in order to make sure that we don't see Vanishing gradients and training is stable and then we will compute the attention matrices such that the sum of all of these values in a row is going to be one and then now what we want to do is compute the context Vector so first we'll take our value tensor this was the original value tensor when we're Computing the query key and value way back in the beginning we are then just going to compute the mean of all these values across the last but one dimension which is going to be across this 10 Dimension so we're going to compute the mean of all of these values in this column mean of all the values in this column mean of all the values in this column and mean of the all the values in this column and you'll get this tensor over here of just four values and we will simply repeat this 10 times in order to get this is going to be our context value Vector now it's only a subset of these are actually going to be updated though because that's all the active query vectors that we have and so what we're going to do is which of the ones that are updated it's going to be the ones that are defined in mtop the top six Active values and you can see that which are the ones that are updated it's going to be index one index 3 index 4 index 6 index 7 and index 9 which kind of matches exactly what mtop says right you see that 1 3 4 6 7 and nine are here and we updated it accordingly with the attention cross value Matrix values and so overall this will have six Active vectors and four more passive vectors and then this is then taken as an output which we then pass into the distillation operation that will happen in the next phase and we're going to discuss that in a future video quiz time it's that time of video again have you been paying attention let's quiz you to find out how does prob sparse attention help over full self attention a it ensures the context vectors are overridden B it never multiplies the full query and full key matrices together C it uses a distillation strategy to reduce queries and keys or D it chooses bat normalization and pulling to select active queries comment your answer down below and let's have a discussion now that's going to do it for quiz time and pass two of the explanation but do keep paying attention because I will be back to quiz you in the last two passes we took a look at the mathematical representation the architecture as well as the code for both full self attention as well as prop sparse attention now during the few passes we Illustrated some cost analysis complexity in between but let's actually more formalize it here in this final pass to truly show how and why for longer sequences props pars attention can be more efficient than the full self attention so let's get to it so coming into this full self attention piece we have the inputs passed into here we will have query key and value tensors generated and for every attention head we are now going to compute the query times the key tensor now this operation is going to be the full inputs Cross the full inputs and so we will see that it will be of the order of lq the length of the query tensor which is going to be the sequence length times length of the key tensor which will also be the sequence length next we're just going to perform the adding of a padding mask which because there is going to be lq cross LK values it's going to be order of lq cross LK K the softmax operation for the same reason because we have lq cross LK values it's going to be order of lq dolk and then we'll have the attention Matrix 2 where this operation two is going to involve the multiplication of an lq cross LK Matrix times an LK Matrix cross some other dimension in this case which will be like your D model or DV in this case and at least if we just consider the inputs it's going to be an order of operation of O of lq cross LK and so if you add all of these up for this entire full attention operation for space complexity and time complexity you'll get order of 4 lql K and we already know that lq and LK are both the sequence length L and so the order of time complexity and space complexity is going to be o of L squ so it's going to be quadratic in terms of the input which can make it problematic for much longer sequences now in order to deal with this in comes prob sparse attention and let's see how we can actually generate this math so we have the query key and value tensors here for which we will sample these key tensors and then we will perform an operation of multiplication with the query tensors now this entire operation is going to be more efficient as we now only have lq for the query tensor but we have LK bar for the key tensors and hence the order of time complexity and space complexity is going to be o lq l k bar next we're going to compute which queries are going to be the most active queries and this operation is requires us to compute the maximum of Q for any K minus the mean of Q for any K and if you kind of look at how that's done the maximum affinity involves taking the max of all of the values in a specific row but it's only like LK bar values and the mean Affinity involves taking also the mean across LK bar values and hence we will get o of lq time LK bar as the total operation time and space complexity next we would have selected the highest or most active queries in this case and then we will perform the uh query selection to get Q Bar so there's lq bar queries that we're applying to LK keys and so hence the order of operation order of lq bar. LK is going to be the space and time complexity and then it's the same for scaling and softmax because we have the same values and we also need to apply the attention for every single value vector and so this is going to be for the selected queries lq bar we would have LK K values or LV values in this case but LV is LK next we're going to use these now active queries or active vectors to update the context vectors and that's going to be just a simple operation of the order of number of keys itself now if we take all of these values and then try to add them all up in order to compute some complexity analysis over here you will see that you'll get 2 lq LK bar plus 4 lq bar LK plus LK and L is going to be the sequence length and you know lq bar and LK are some order of log l and so if you kind of plug those values in you will get the total time and space complexity to be o of L log l which is more efficient than the quadratic complexity that we saw for full self attention and hence this prop sparse attention it can be a more efficient way to compute attention and perform attention for much longer sequences and I hope that's pretty clear here quiz time this is going to be a fun one how do time and space complexity change with respect to input sequence length from Full self attention to prob sparse attention a order of n Square to order of log n b order of n² to order of n log n c order of n log n to order of n 2 or D order of log n to order of n s comment your answer down below and let's have a discussion and if at this time you think I do deserve it please do consider giving this video a like because it will help me out a lot now that's going to do it for quiz time and pass three of this explanation but before we go let's generate a summary [Music] the traditional Transformer uses self attention but this can be inefficient in time and space complexity for long sequences this is because the core of the attention operation requires applying every time step to every other making the operation quadratic in space and time complexity with respect to the input sequence length to deal with this the Informer uses prob sparse attention and as we have seen in code prob sparse attention can efficiently perform the attention operation in O of n log n time and space and that's all that we have for today now to understand more about the Informer architecture itself I have a video right over here that you can click on to check it out but thank you all so much for watching and if you do like what you saw today and you you think I do deserve it please do consider giving this video a like subscribe for more videos and I will see you in the next one bye-bye

Original Description

In this video, we code the prob sparse attention and compare it to time series attention ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ RESOURCES [1] Main repo: https://github.com/zhouhaoyi/Informer2020/blob/main/models/attn.py [2] Code for the colab notebook: https://github.com/ajhalthor/Informer/tree/main PLAYLISTS FROM MY CHANNEL ⭕ Deep Learning 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_NwyY_PeSYrYfsvHZnHGPU ⭕ Natural Language Processing 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ Reinforcement Learning 101: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 ⭕ Transformers from Scratch: https://www.youtube.com/playlist?list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4 ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i384100.net/python 📕 MLOps Course: https://imp.i384100.net/MLOps 📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP 📕 Machine Learning in Production: https://imp.i384100.net/MLProduction 📕 Data Science Specialization: https://imp.i384100.net/DataScience 📕 Tensorflow: https://imp.i384100.net/Tensorflow

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches how to implement Informer attention from scratch, including prob sparse attention and time series attention, and provides a comparison between the two. It requires a basic understanding of deep learning and natural language processing concepts.

Key Takeaways

Clone the Informer2020 repository
Implement prob sparse attention
Implement time series attention
Compare the performance of both attention mechanisms
Test and refine the implementation

💡 The Informer attention mechanism can be implemented from scratch using prob sparse attention and time series attention, and the choice of attention mechanism can significantly impact the performance of the model.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Super Mario is mathier than you think

Super Mario's world is full of mathematical concepts, making it a great example of how math is used in real-world problem-solving

MIT Technology Review

A Geometry Puzzle With 3 Circles

Solve a geometry puzzle involving 3 circles using mathematical reasoning and visualization techniques

Medium · Data Science

The Consecutive Integers Divisibility Trick

Learn the Consecutive Integers Divisibility Trick to simplify difficult proofs in mathematics and programming

Medium · Programming

The Mayans Invented Zero Before Most of the World — Here Is Their Number System in Python

Learn about the Mayan number system and its implementation in Python, highlighting the importance of zero in their base-20 system

Medium · Python

How to Open OSM Files (OpenStreetMap Data)

File Extension Geeks