Dive into Deep Learning (Study Group): Linear Neural Networks | Session 3
Key Takeaways
This video covers linear neural networks, including linear regression and softmax regression, using tools like PyTorch, TensorFlow, and MXNet, and discusses concepts such as computation graphs, loss functions, and optimization procedures.
Full Transcript
the idea for today is to basically um basically go through a couple of of slides here uh discussing chapter three okay and you know just just just just give an idea on what linear neural networks are some of the most important components and support and oh wow so so loud outside just give me a second folks let me just try to see what's going on oh wow sorry but that's um kind of like a little accident outside but i think everything is okay all right so let's let's continue um this is where we are we are on chapter three right linear neural networks um we'll start to talk about some basic components and then eventually move on to let me just make a full presentation um and then later on we're going to move into multi-layer perceptions where it starts to get really interesting and you start to see things that are more common these days in deep learning and so you'll start to see some more interesting things um so the agenda for today i'm going to do a brief summary of the chapter 3 discusses linear regression and softmax regression right these are the main methods for basically implementing very basic neural networks linear neural networks in particular and i also have a code session uh most of it is going to be about code and try to explain to you for those of you you know that are starting with pytorch give an idea on how pythage works and how to actually build a model or of course everything is adopted from the book but just i'm going to explain to you how that code works and so forth so i'm going to talk about the implementation from scratch which is great because this is how you actually start to begin to understand how to actually build models on neural networks and you'll start to understand this idea of of the computation graph and how it makes sense when you're training things and then the concise implementation is obviously what you see today when you see an implementation for instance say a paper implementation um you see concise implementations which is you know using leveraging those high-level functionalities of these different toolkits like python tensorflow or mxnet right so you don't actually see what's going on um all the details but you only see the high level use say of of linear layers and nonlinear transformations and these sort of things and suffix regression of course more for classification problems right we're going to talk about softmax a little bit what it is and what it achieves and actually talk about in the coding session the um introduce the image classification data set this is where we usually start i actually wanted to do a different thing wanted to do more like an nlp use case but didn't get enough time to actually do that but i think the image classification data set is quite good it's using the fashion and this which is the fashion data set which is really great to actually start with and then start to understand some of the functionalities in fighters and how to build models and then implementation from scratch as well and then there's a constant implementation i'm not sure if there's actually a concise implementation or but i think there is an implementation from scratch so i can do i don't remember quite well if i actually put in that code there but i think it'll be enough for today so linear regression right regression is what's used for modeling relationship between you know one between one that should be a one one um more dependent variables and a dependent variable right what does that mean um as an example imagine there's like housing prices which you want to predict so what would you use as your the the independent variables right you would use like you know size say of a property or something like that um the location could also be important for predicting those prices and there are a lot of things like how many rooms and so forth so those are independent variables dependent variable is the pricing itself um and you can see some examples here right but the idea of the of this linear regression is to the trainer model he's able to predict that numerical value this is a continuous value of course um right and then there is no there's no limit to what this value can be here and depending on the use case of course and you have many examples here predicting prices predicting length of stay say in a hospital and demand forecasting these are the examples explained in the book directly so that's the ones i use here but there's so many many other examples of linear regression and you know linear regression one one common confusion is that of course people confuse it sometimes with classification people that get are getting started with machine learning in general these are two different things classification with linear regression we're interested in that numerical value predicting a numerical value the classification is we want to make like a like a like hard assignment for instance sometimes there's cases of staff assuming as well but heart assignment means maybe you want to label a picture as cat or a dog or you want to identify it as a positive or negative tweet or something like that right that's what we refer to as the heart assignment but later i'm going to discuss what the idea of classification is we're going to get started with linear regression as a beginning right so aiming to predict classification aim to predict the set of categories okay so very rough notes i put together here i think i just wanted to highlight some of the most important points in the chapter there's a lot of content there i mean it's very i think in my opinion it's very beginner friendly there's a lot of details unnecessary details in some sense but i think it's great because if you really want to dive deeper it gives you some ideas on where to actually search right so this idea of the for instance the maximum likelihood estimate what is this about how do we actually use those principles when we are designing designing neural networks how do they actually what's the role of this um on this principle so there's some guidance there there's even some links where you can actually go and read and try to understand where this concept come from and if you actually go into books there's like all types of derivations all types of equations that are actually adopted in machine learning and this idea of the probabilities and why that makes sense when you're training neural networks all of that stuff i think um the book does a really really wonderful job there um and i think the purpose was to give you an idea and where to go look for more content and get an idea of what's going on but yeah they try to keep it very minimal in that sense but they give you still that little hint vertical look for things so what's the idea of linear regression and the basic elements right um the assumption of course there's always an assumption when you're working in the context of machine learning right there's you always have this kind of assumption right you're building this model you want to establish some linear relationship between the independent variables to give you the example and the dependent variable y and you start to introduce some notations but in the next slide we should we'll show you um a bit of the notations that's used in the book and i like it that's very consistent and typically this is the notation that you would see in papers and so forth and why can be expressed as a weighted sum of the elements remember the weighted sum um we spoke a little bit about i don't know if you got a chance to look at it more but discuss this operation when we discuss preliminaries we discuss the operations the um the dot right the dot um the dot product right that's the weighted sum that's how you would get the weight in some of the elements um and that's how this y will be expressed uh given some noise on the observation what is this noise about um of course that's the assumption we're making that you know the it follows some gaussian distribution that's the idea the of course the observed data and you know that that's something that we use and we leverage right and based on this assumption we can come up with all our equations that we're going to use to say you know come up with our last function how do we actually output probabilities and all these different things so we always start with this um fundamental thing this this assumption and you can read more on the on the chapter more about the dashing distribution and how well it plays and i like that there are some equations that are given but i think you still can go you know you can go further trying to understand it and i think the book doesn't really provide you a link to that but maybe i'll try to come up with some some links to give an idea on what actually from this from this idea do we actually use in machine learning and deep learning specifically um and i'm going to try to work on that because i think it's really important if you want to gain more intuition but if you just want to apply it i think up to this point you don't really need to go further right so the example here is we would like to estimate the price oh sorry i'm just checking here that i'm everything is still working i'm afraid that my computer is just giving a lot of trouble these days but yeah everything is okay so example is we like to estimate price of houses based on area and age you can see the table there right you can see that um we have area and age is some and we sometimes refer them as features the same thing as independent variables and then your dependent variable is a price that's a typical linear regression problem actually very classical example here with the area the price and so forth but this these these examples that i provide here is just to give an idea of how this data would look but this one are actual real examples and you know we want to develop a predictive model for predicting those horse prices so what we're interested in is actually developing these predictive models and what concepts do we actually need but what are the components and the pieces that we actually need to develop this predictive model that's what we're going to try to answer here with the with the size and and also the the coding session so what do we need so we need of course the training data set the training set right could be any data set that you're interested in or whatever problem you're working on and typically you have like the rows right rows reference examples data points settings and samples you see all these different things we just don't have a standard way to refer to these things people you know maybe have influence in statistics they refer to it as something else so it depends on where your background comes from you always have people um you know referring to roses as different things right but they're referring to the same thing um sorry as different names so data pointed instances and you see that books actually it's not really consistent as well you have a different ways of actually referring to rows and the dependent variable is called the label or target right that's what you're targeting this is why you want to build a predictive model to actually um you know do some inferencing and actually predict label or target and independent variables as we say are called the features and covariance in some cases as well right so if you come from statistics you may know what the career is but you know if you're just working on or maybe just beginning to learn machine learning you'll most most most of the times you will hear people refer to as features all right some notation here very simple and denotes the number of examples right that's that's i think that's quite obvious um how do you index into exactly the data example says your code maybe this um it's not so important but i think if you want to understand most of the equations in the book you will probably need to understand this at the beginning so you know the superscript is just indexing into a data example and what you see here is you know you have different features this is actually a vector using bold right x on the left and here you have the different features like in an array and then you have this like transpose and that's just like trying to format this particular feature vector in a columnar sense right so put it like in a column and once you have it that's kind of the ideal way on how you would feed this data into into a model so that's why you see this constant notation here corresponding labels this is just the y notice how the y's has also the superscript and that corresponds to an example as well or a sample or data point whatever you want to call it and how what's a linear model right so i'm just picking the bits that were important from the book and i think this is the beginning where they start to discuss on the leader model and again remember the linearity assumption right we we have this target and the target is kind of obtained from a weighted sum of the features and the features you notice here is arian age and of course you can see that we have these w's here and these are just um you know the weights and we also have a b which is the bias so if you look into some other books maybe the deep learning book from i don't remember the author's name but i know in good fellows one this stuff is you know discussed more in depth this is a great reference and i actually provided that link in our repo you can read more about you know what this actually mean so the way it's just basic ideas that you know what's the influence um of this particular feature when actually making that prediction right what's how important is this feature to make that prediction uh to get a proper prediction here and the same thing with the bias so the bias is a little bit different is um imagine that some cases maybe might be kind of rare as the book explained it sometimes maybe there will be an output of zero and you don't want that maybe you know you could have some kind of random value here or something like that that you can actually feed to this b this bias and this becomes your bias term as opposed to just leaving it as zero so um and that's the idea of the bond and this all this what i'm talking about here is also referred to as a fin transformation of input features now this operation that's going on right here okay all right so what's the goal now you want to choose waste and bias these are your parameters do you want to estimate these parameters and and um write such as on average your predictions of the model fit the true prices observing the data so we're trying to kind of create an estimate right and know we have our reserved data right and we want to fit it okay so we're going to talk more about this in the coding exercise and according what you today and what does this mean actually in you know in code because that's the important thing i think um and again obviously this is dependent on that affine transformation that we discussed in the previous slide and that's chosen by the um specified by the chosen weights and bias of course weights and bias very important so we have our you know we have here the the weights features we have the bias as well that is used to make some prediction so how the question is you know how do we determine the weights good weights here and that's that's that's the whole idea of of of neural networks here and then this in this book right this is going to give you um a method to actually come up with these parameters and learn them and the compact form of course the book introduces this one here um this one with the of course you go from here because you vectorize things they will look a little bit different the notation changes a bit notice the x here is the vector and then we have totally as well presented as that we have some transpose as well so there's a little bit of uh detail here which is important when implementing um but these are all like to take advantage of right this is um from a mathematical standpoint you don't really need to worry too much on the notation here but when you're implementing them you make sure that um you actually you actually you put the right matrix in the right place otherwise fighters is going to complain and that's really important that's why it's important to understand this right so now this one this this one in particular is just talking about say a particular example but you know you want to work on an entire data set so it's going to look something like this right so features of the data set x represents that matrix it's going to contain all the features right um of the data set all the examples and so notice how the notation changes as a little bit well then not the notation but the formulation here changes uh we're doing the same thing but it changes a bit and we take advantage of vectorization to actually do that and that's that's the idea so at the end the goal is to find the w and v those parameters right such that for a new example and it's labeled the model makes the prediction that produces the lowest error and i start to introduce this idea of lowest error what does this mean so we're going to talk about it about that shortly right so we have formulation right another entire data set we know how it looks um we we we kind of slightly mentioned the linearity assumption as well that's what we're using to create this predictive model and we're depending on that um but we need a we need a way to actually measure the quality of this model right measure the goodness or badness of the model and and we also need a procedure to update and improve the model quality because this is going to be a of course a learning procedure so how do we actually achieve these two things so we need a large function and we need the optimizers you may have heard about these two and that's the role they play right one for measuring the quality and one to kind of have a procedure a really reliable procedure to update those parameters and try to improve the outputs of those models last function and optimize so let's talk a little bit about the loss function and we're going to get into more details as we run through the code as well but this slide is just giving you some some some basic idea of what it is um so the idea is this objective right is a function that quantifies the difference between the real and predicted values so you have an actual um you know you observe data and then you have whatever the models predict right then you want to quantify that difference so you can see here in the in the figure i provide so you have this kind of um now the straight line going across and you can see that there's some observation there's also some estimation as well um right estimation whatever the model learned um and how far that is away from the observation that's going to be the difference so how do you actually calculate this and how you do quantify it that's what you're interested in and so the idea of the last function right it's it's the smaller the value of course the better right that's color the smaller the valid velocity better and one of the popular functions in particular in the linear regression is the squared error i will talk about how do you actually implement the squared error it's quite simple actually it's just taking the estimation and observation and calculating the difference and doing a square on that and you can see the equation here right you have um the equation you even have this term here and obviously when you take some derivative we haven't spoken about derivative yet but that's a topic maybe that we're going to cover later on but when you take a derivative actually this will be really important to understand derivatives because it's going to be the assignment for this one um for this particular session after this particular session but you have this this kind of squared here which which kind of this this this half here in the component this term actually cancels out when you take the derivative and we discuss a little bit about derivatives you start to see what role displays and why it's important to understand it because obviously if you don't understand there it is at least the basic rules you don't know what's what what the hell this is why do we even have this term here but this term is just for convenience right what you see in the in the in the in the brackets we know what that that is right there's a difference but that term cancels out you take the derivative so that's why it's just for convenience sometimes you see a lot of this happen when you you see these loss functions it's always kind of a term that needs to be canceled because it favors uh february um okay so we we also here showing you the average loss on the entire data set okay so you can see here everything put together uh it looks like a really really long thing that is confusing but actually it's not so you can see here how we went from right the the predicted value which is y here right and we are doing it for each sample you can see how um here we have i is equal one that's for each example n is the number of examples that's how you would say and you have the summation of that and so you have the entire loss for entire data set on the entire data the average loss on the entire data set okay so the average obviously we get it from the one over n because n is the number of examples so you can see y refers to this right with the linearity assumption that we discussed and then we have this kind of y here as well which is just the the actual data the real value okay so you see the terms here again and when we take the derivative of this obviously remember um last last week we discussed a little bit on the why we do the derivative why that's important obviously when you learn you're learning the parameters and um and you need to take a derivative of this loss with respect to the parameters and that's that's where you you start to pay attention to those little terms over there okay and i think let's see what else i wanted to say here uh yes just discussing a little bit about the notation just trying to explain to you that you know just pay attention to that when you see this is the way you would understand it what the superscript means and this and so forth and this is very consistent consistent across the across the the book as i've noticed all right so we have our last function and what we're doing is just um we're training this model we're looking for those ways and bias right um and we're going to talk a little bit about how you actually look for those weights and bets what kind of principles you use shortly we're going to talk about that but the idea is once you have those parameters those optimal parameters you can you try to minimize the total loss on all the examples all the training examples and that's just what you see here in the equation but don't worry about that too much if you don't understand the the notation here it's quite okay this is just more like representing in a different way what we had here in a more compact way let's see right optimization what is optimization about oh we need an algorithm we need a procedure we have our function we have to find it something reliable for this particular model but we need some optimization procedure and you know we we this is when we start to talk about the optimizers the grading descent algorithm and you know variations of the grading decent algorithm as well so we have many books out there that actually discuss uh the benefits and advantages and where we actually needed to introduce better algorithms for this and the good thing is that most libraries today like tensorflow and and pytorch they always have implementation of the latest optimizers this is a very important thing you always want to have optimizers that are fast you know always gives you um always improves the speed of the model and how it's trained right efficiency of the model so this is a very important part very important component of training neural networks i'm not going to go through all the details here on the on the slide this is i'm going to share this slide later on you can take a look at it but this is like my summary of of what the optimization procedure is all about what it does um but one important thing here that i wanted to emphasize on is that um you know we had the green descent algorithm of course to um to to actually use this one right we have to have to actually put this operation on the entire dataset which could be really really expensive and so that's why we introduced this idea of minibus to conservative descent right when you when you have the entire data set it could be you could get the you know the results that you want in terms of the uh the parameters that you're learning it's good right but um in terms of how efficient it is not sorry how efficiently but maybe how fast it is it's not the most convenient one so we needed like something better than that and so that's why we introduced the mini batch which looks at mini batches and and and take you know take uh faster steps right and but obviously this one obviously has some drawback as well you can read into books that explain what are the drop and there are so many other variations as well you won't be able to cover all of them but um yeah they achieve basically the same thing it's just it's just um one of them is more efficient than the other depending on the setup that you have uh you may want to you consider one over the other and there you know of course i shared a link the last time on optimizer someone wrote a blog post about it and take a look take a look at that reference and discusses all of these things and advantages and disadvantages all right so we have the procedure we have our last function we discussed the linearity assumption this one discusses more a little bit about the mini patch and how it works so at the end our objective is to kind of have this predictive model that does well on the predictions but um in this process we are looking for those parameters as optimal parameters and in order for us to actually get those optimal parameters we we actually have to do what's called the the bad propagation right so with the back propagation we start to learn a little bit about partial derivatives and what role they play in in this learning procedure and i just pulled this this equation directly from the book and this is how the parameter update is done um using the minibatch sorry by stochastic grid in descent um so just put a little bit of notes here what they are you have to go into the book obviously to get the full explanation but i just wanted to highlight uh some of the things that uh this this this equation in particular is what this is doing so we have you know the learning rate the learning rate determines you know um how how fast this algorithm should actually learn right determines the rate and obviously that's the partial derivative over there and we explained the loss function previously as well you can see what is there you have those parameters being the um what we are going to calculate the partial derivatives um with respect to those parameters right on that loss function um and then we also have the mini batch size which somehow how becomes kind of like a normalization factor here and you can see it here and then obviously once you have that then this is what you will use to you know update those those parameters right so you have your old one you minus that and then you have your your your parameters here and you keep improving them to some procedure and this is what the algorithm briefly does like initialize the model parameters this is done typically randomly and then you have you know sample the random mini batches and then you update those parameters in the direction of the negative gradient um and if you if you really don't understand you know this statement here update parameters in the direction of the negative gradient that's that's actually going to be the assignment and one thing i would suggest to you um i could talk about this one i could give you examples right and how it works but i think what could be really useful and as a like a research exercise for you and just just go and go online and try to research there's so many different people that have implemented um you know that backward function in pythers i implemented manually i've actually done that as well i've provided some notebooks for that in the past but it's quite important to understand how to implement it how to actually implement it manually rather than just using that backward function that's provided by pi forge to try to understand what it is because it tells you it's gonna it's gonna show you how to actually come up with the with the gradients and how to do that update with on the parameters and so forth so use those equations to actually guide you um and but i think you need to do extra research so i really want to encourage you to do additional reading and additional research and try to come up with um try to implement that backward that backward function in pi torch and i'll show you later what what i mean by that but don't worry if you don't if you didn't get that now so a lot of text here don't we'll not go through this one um i just wanted to spend some time on the slides but all this stuff is written in the book just want to highlight some important points but um of course the learner rates that's a hyper parameter they're more hyper parameters which we'll discuss um but yeah this one yeah so this the most important part obviously the point is that if you look into the book it explains you a little bit on the maximum likelihood estimate principles and how we actually use those principles um to get to something what's that's called the negative log likelihood which is you know our last function that's that's where we're gonna we're gonna use uh and it's a very very important part of training neural networks and you know coming up with good predictive models you really need a good loss function that measures the the the quality of that model um but it's really good if you maybe the book doesn't explain too much about the principles here but if you open uh maybe a probability book this topic is discussed statistics probability if you go into those books um you know those books go into really really really great detail and what this is right this is a very very important part of building predictive bottles um and explains to you the principles there and how do we actually get the negative log likelihood and that's what we want to minimize at the end of the day but take a look into that thing is a very important topic um and if you have any questions of course if you're confused about anything uh just let us know okay um this slide is just showing you of course a lot of people compare the the you know this one is just a linear neural network you can see one neuron there is the output that's the one layer right one one a neuron layer and then you have on the right what's the biological neuron what it looks like um but personally i don't really like to i don't really like the semology i don't think this analogy has anything to do with you know with neural networks um i think it just deviates the um deviates what neural networks actually do and what they're good at and so that's personally why i don't like this this this um this analogy but the book offers you a little bit of context on and why that comparison um you know some people at the beginning at least decided to talk about it and make that comparison and try to kind of hype things up and well it's your choice if you really want to believe it but i think at the end of the day i think it doesn't really help the discussion and being able to understand these things it actually confuses people from what i've seen okay and the classification so we did linear regression classification i just have a couple more slides so bear with me here um classification right see to predict a set of categories either it's a heart assignment or self assignment the heart assignment is just classes if the self assignment in some cases you are really more interested in your probabilities and we will talk about how you actually come up with those over probabilities using something like a soft max function or something like that um but yeah those are the two variations that you have um and we want a model right that estimates that's the goal when the model estimates the conditional probabilities without um a possible classes right and some the model should be able to upgrade multiple outputs one for class at least and here is the ultimate goal here at the bottom which is to to get those parameters to get optimal parameters right reduce the proper probabilities to maximize that likelihood or the probabilities um on the observed data okay so that's the idea and then all of this stuff is just like repetition it's more like a compact summary of everything that i've discussed in these slides and one approach of course is soft max regression we'll touch into that code and so one or two slides here on southwest regression supports classification binary multi-class as well that is just to have the softmax functions which gives you really nice properties and and um right they meet some desired criteria and so you when you're outputting like when you actually get the outputs directly from the you know using the linearity assumption the you know the wx plus b whatever that you saw earlier um those gives you the logits but you know those those logits they those values they actually come in a way that doesn't really really um they're they don't have the the the the criteria or the or the qualities that we want so we need something else and we need to transform those into opera probabilities right and so we have this idea that we want to meet these desired criteria so we want to make it non-negative we like non-negative values right when we're working with predictive models so that's why we have this non-negativity and the way we get this non-negativity if you look into a softmax function is just that you need to do some exponentiation with some values and once you have the exponentiation it gives you that negativity so that's that's what you see this function has this exponentiation function in it it has to be differentiable obviously because all right we we kind of we want to learn parameters right we are we need to we need to on model that is able to you know propagate um that result that we get and try to learn those new parameters using that procedure that we discussed earlier um and so it needs to be differentiable right and and and you know some functions are not really differentiable and but this one in particular has that that desired criteria of differentiability and also if you're you know building a predictive model you you really you really are interested in in those probabilities and that they follow you know that criteria as well um some specific distribution so um in this case you want them to sum to one right that's just following the at least they follow the axioms of probability this is what we discussed in the previous lesson so that's actually a very important thing as well so we have those nice uh criterias that we're meeting here with the softmax function but obviously originally softmax function wasn't really uh using this context it just happens that it has those really nice qualities and and and you know they were adopted neural networks okay so and then eventually i think one very important thing as well with the cross entropy laws we had the laws for the linear regression or classification problems in general and actually very common we use across entropy blast function for classification problems right so the same idea right we want to do like um so we have like the we want to measure the difference between the two probability distributions right and that's that's going to give you a measure on how good this model is actually doing on the particular task um yeah that's what this one is explaining and if you really want to understand more i would say take a look at lisa that segment that's information theory don't jump even though it's not really highly like connected with deep learning and so forth but just understanding that bit of information theory and and you know how how the cross entropy loss is related to this and what it actually measures i think it's quite important right so the assignment is going to be this one for those of you that are working on the on tourism certification um you will have a backup function i'll show you which one it is that you have to work on and implement at least for the linear regression part you don't need to do it to the softmax one um but for the linear regression one i think i think should should be easy to implement but the idea is to encourage you to actually do research i mean when i was actually learning about these things i remember that was the first assignment the first assignment was go and try to implement backward i was so upset i mean i didn't even understand like the equations at the beginning of what they were doing and what they but just going through each step and trying to understand and try to code it myself i got a better intuition of what it was and i think that was such a wonderful experience um because it pushed me to you know going into all these details and try to understand it myself if i were to give you a solution which i will do in the next session i'll give you my own solution um and talk about it in terms of code um you know i don't think you would get much from it and i think it's such an important part of of training neural networks that i think if you want to dive deeper you really really need to understand this concept all right so that's going to be the assignment it's an individual assignment um and right so you will have i think maybe one week won't be enough i'll try to give two weeks for this one i'll put the deadline i think two weeks should be enough um and and i'll provide a solution for you also not probably next if i make it two weeks then i probably won't give you next year i'll try maybe move it on to it depends and depends on on when the deadline was but um try to try to give me some feedback on that whether one week will be enough or two weeks or you need two weeks it depends on you as well but i would like to give you two weeks i think it makes sense to give you two weeks okay so let me know in the chat in the slack group if that makes sense and if you think you can do it in one week then we can do it one week but um for now it stays at two-week you have two weeks to actually complete that all right so okay we have a demo okay how now i escape this okay all right so let's jump into this one uh so this is a kodi walkthrough let's see how much time we have we already did 45 minutes um i don't think it's gonna take more than an hour to be honest i think it will take less time but let's see oh there are so many questions right many comments here okay let's see okay so yeah taking the average of endpoints thank you for that has one that's that's a great answer right there someone was asking about what the one over n stands for okay just checking to see if there's right let's see okay most of them were comments all right let's move forward right so linear let me just try to expand this a bit i think the most interesting part is obviously the code in the book i really love the code segments i don't know um what's your experience on it but i really love it i wish i had resources like this to actually learn these concepts um not everything will be explained that's expected but i think you have to do your part as well you have to you know go through certain things yourself and try to get better intuition at least but book really does provide a good foundation for training neural networks so that's that's i think most of you that's what you want to learn all right so i actually this notebook i will share later because um this will become the basis for that assignment so you can actually there's a segment where i say it's an assignment and and for those of you that are interested in completing that assignment i really challenge you to actually try it and see if you can get it done these are just the libraries right we have the d2l there's a particular library for the that accompanies the book which is great has some good functions that we can reuse and obviously this is mxnet but yeah we need to install mxnet to actually make use of those functions now these are just uh let me just go through this really quickly these are just the libraries we'll use route right numpy there's nothing um unfamiliar here the map is just a matte function to get our mod functions our time random and numpy as well kind of use that as well here um so yeah we said we discussed a little bit on vectorizing how important that is actually if you go through any i think for most especially popular machinery courses that i've seen are deep learning courses there's how is this chapter where vectorization is is is explained it's quite an important concept to understand and you know when you see an equation how do you actually implement it most equations um you know probably already are in their most optimal form so you can definitely leverage vectorizations for any of the tokens you're using to actually improve the speed you always try to you know make those those parts optimal transformations on the data or something like that you always want to use that vectorization capability of your toolkit that you're using right so here it just explains to you some examples right just to show you the efficiency of the vectorization actually work on this i had this notebook prepared from i think last week so hopefully i remember every section but there's i just kind of just to say that again i always forget but obviously all they created for this material goes to the authors of the dive into deep learning book and obviously there are contributors because people are actually contributing this despite fresh code and so forth right nothing here is originally done by me i just kind of adopted the code and we have permission to actually do that so i always want to say that in front of that's really important to give people credit um where it's due all right so yeah so some examples here we have right um we discussed what the ones turns at once does and then we have a and b and and you know how many how many examples we want it's ten thousand so we have these these um these matrices right here and a and b and i think this yes this is a really neat class usually you won't see this you don't really need to implement this but i think it's quite useful um and i like that they actually implemented this um it gives you some you know really nice functions here to set up a timer right so you want to know how long code takes to run you can actually just leverage that here it creates a timer with the timer class you have a timer object here and you can actually you know easily check how how long a piece of code actually takes to run and here we see that we you know for loops are not our best friends when you're building neural networks and models so we have for loop here but ideally what we want is we want to implement uh do the same operation here which is just adding components of one i think a vector to another right and and and then assign that to this this additional one which is c um and here we're doing it individually right for each of the items and we're iterating over each one of the items but ideally what you want to do is you want to do something like this where you could just go a plus b without we talk about how efficient this is actually is right and then how those tool can say python tensorflow actually implement those and give you those capabilities to actually just go ahead and and do these sort of operations like this rather than you know iterating over each height that's not the most effective way to do this or efficient way to do that um okay so take advantage of those reloaded operators here and this one just does an element-wise sum we discussed that in the previous in the previous session now let's see if i run this actually let me run this one first right so you see how long it takes and by the way this is not this is not a really huge vector but you can see this one took 0.12 seconds and then this one is like 0.00037 so you see how efficient it is and actually as you add more examples those data structure right here um this actually i think exponentially um takes a longer time so just just pay attention to how efficient that is and try many different examples to get an idea of that but this is becoming more easier this is the reason why the last time i discussed the profiler i think the profiler in pythagoras makes a lot of sense in particular but other tools actually have profilers where you can actually kind of diagnose this and get an idea of what parts need to be improved and optimized and so maybe you have a for loop that is unnecessary there and those tools can help you figure that out as you practice how to build neuron neural network models um okay so an implementation from scratch with linear regression let me just take some water here okay right so this is a implementation from scratch so we have the data pipeline see how the different parts and then the modeling then the last function then you have the optimizer right the optimizer is the procedure that we need to actually train this model um now we have a data set let me see if i miss anything here okay so we have a data set here right an artificial data set is generated values some gaussian distribution normal distribution and yeah this is what it's doing just rendering some synthetic data you can see here your x you can see here your y x is just uh i believe it's a matrix yeah and then you know this right the transformation right if you transformation that we discussed right we already saw that um that equation that's that's what you see here and the matmo takes care of that right it's multiplying matrices you can see how you get your y and obviously right here you want to let me add this additive noise from you know that basis the gaussian distribution and you can see the book form for more ideas on what what this means but this is how they actually implemented the contributors to implement this one um okay so return that information back and you can see here let me just render this okay so we have our features our labels so we have features then we have labels and so if we run that right here so we see that the feature shape what's the shape there's a 1000 by two right 1000 rows we have two features here um and then we have uh the labels right it's just one label for the 1000 examples and yeah so this is what what this one this is what the this particular model is very basic model actually return you can see the values there um and then the label itself which is what's here uh okay let's see okay so this one is just printing out the right the features sorry so this one was a features for the first example and then the feature the labels for the first example so i got that right actually so yeah that's this is what it means all right so let's go let's keep moving here this is just giving you some ideas on the things that they were going to use later on um if you're starting off this is why i like the examples because it really starts you with basics and then it moves to the actual model and how you would implement it so i really like that from the book i think it's very approachable in that sense so okay here we have the the figure so we have now we actually have a plot here and that's our synthetic data right here that's already plotted and you can easily see that maybe um yeah a linear model would be sufficient here to actually um you know maybe deal with this particular data set we run this one so that's how you get this one okay oh yeah that was it this is just more like examples and things that you will use right around but there's actually a lot of code that actually jump and skip i didn't really want to cover that you can actually do that on your own but that's the idea so how to read the data set right that's the most important thing i think like how to load the data set and load it efficiently so when you're training a model how do you actually feed this data makes a lot of uh it's it makes a lot of difference because if you feel it wrong if you don't feel it efficiently um yeah you will always you have models training forever and i think a lot of you have experienced this before that you know sometimes you know we run out of memory pretty easily and really quickly maybe because we haven't really um we're not loading that that data properly that's quite important to understand as well so this one is just um creating an i think data iterator so just creating some way to iterate over data and it's telling you you know um how do you actually do that in a manual approach i think um so let me just run this one okay so here you can actually get a batch so for x and y using that function you pass in the batch size which is this the features and the labels and so you can see here some printing some examples and of course i want to break that's that's very simple code but you can see how it's using the um so it's actually creating a sample my obtaining some samples here and of course our samples of batch size 10. that's what you see output here right so the note here is that is efficient implementation um because we're doing a lot of random memory access that's that's that's the the most important part here because as you see here we're iterating over each one of those items and trying to figure out our batches and so forth and this is how it
Original Description
Dive into Deep Learning (Study Group): Linear Neural Networks | Session 3
In this third session of our "Dive into Deep Learning" study group, we will begin to look at linear neural networks which provide a great introduction to some of the most fundamental concepts in deep learning.
Entire playlist: https://www.youtube.com/playlist?list=PLGSHbNsNO4ViFXawDmx-kEz7zGziOpNSb
You can find more information about the deep learning study program and upcoming sessions here: https://github.com/dair-ai/d2l-study-group
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Elvis Saravia · Elvis Saravia · 15 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
▶
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
101 ways to solve search (by Pratik Bhavsar)
Elvis Saravia
TLDR Generation of Scientific Documents | ML Interview #1 with Isabel Cachola
Elvis Saravia
Sentiment Analysis: Key Milestones, Challenges and New Directions
Elvis Saravia
Discriminative Adversarial Search for Abstractive Summarization (by Thomas Scialom)
Elvis Saravia
Question Understanding: COVID-Q: 1,600+ Questions about COVID-19
Elvis Saravia
Getting Started with NLP
Elvis Saravia
Building tools and frameworks for large-scale social media mining (by Dr. Juan M. Banda)
Elvis Saravia
TextAttack: A Framework for Data Augmentation and Adversarial Training in NLP
Elvis Saravia
Dive into Deep Learning (Study Group): Introduction to Deep Learning | Session 1
Elvis Saravia
Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4
Elvis Saravia
How I read and annotate ML papers
Elvis Saravia
Keep Learning ML (Session 1) | DSV, CompLex, Modern tools for emotions
Elvis Saravia
Dive into Deep Learning (Study Group): Preliminaries | Session 2
Elvis Saravia
Keep Learning ML #2 | Language-conditioned policy learning, Effective ML Testing, EagerPy
Elvis Saravia
Dive into Deep Learning (Study Group): Linear Neural Networks | Session 3
Elvis Saravia
Dive into Deep Learning (Study Group): Multilayer Perceptrons | Session 4
Elvis Saravia
Keep Learning ML #3 | Contrastively Trained Structured World Models
Elvis Saravia
Dive into Deep Learning (Study Group): Deep Learning Computation with PyTorch | Session 5
Elvis Saravia
Dive into Deep Learning (Study Group): Convolutional Neural Networks | Session 6
Elvis Saravia
Dive into Deep Learning (Study Group): Modern CNNs | Session 7
Elvis Saravia
101 ways to solve neural search with Jina
Elvis Saravia
(Hopefully-Reusable) Life Lessons for PhD Students in NLP
Elvis Saravia
How to save the world and forward your career in 5 easy steps | Women in NLP Talks
Elvis Saravia
Prompt Engineering Overview
Elvis Saravia
Getting Started with the OpenAI Playground
Elvis Saravia
LM-Guided Chain of Thought
Elvis Saravia
Elements of a Prompt
Elvis Saravia
Reasoning with Intermediate Revision and Search with LLMs #chatgpt #ai #llms #science #programming
Elvis Saravia
General Tips for Designing Prompts
Elvis Saravia
Efficient Infinite Context Transformers #ai #machinelearning #research #llms #science
Elvis Saravia
Best Practices and Lessons Learned on Synthetic Data for Language Models #ai #machinelearning #genai
Elvis Saravia
Reducing Hallucinations in Structured Outputs via RAG #chatgpt #ai #llms #programming
Elvis Saravia
Basic Prompt Examples for LLMs
Elvis Saravia
LLM In Context Recall is Prompt Dependent #llms #ai #chatgpt #machinelearning
Elvis Saravia
Zero-shot Prompting Explained
Elvis Saravia
RAG Faithfulness #llms #ai #gpt4
Elvis Saravia
Understanding LLM Settings
Elvis Saravia
Llama 3 is here! | First impressions and thoughts
Elvis Saravia
Llama 3 is Here! #ai #llms #llama3
Elvis Saravia
Microsoft introduces Phi-3 | The most capable small language model?
Elvis Saravia
Microsoft introduces Phi-3! #ai #llms #microsoft
Elvis Saravia
Make Your LLM Fully Utilize the Context #ai #llms #machinelearning
Elvis Saravia
When to Retrieve? #ai #llms #machinelearning
Elvis Saravia
Training an LLM to effectively use information retrieval
Elvis Saravia
State-of-the-art open-source LLM judges #ai #machinelearning #gpt4
Elvis Saravia
Better and Faster LLMs via Multi-token Prediction
Elvis Saravia
AlphaMath Almost Zero #ai #science #machinelearning
Elvis Saravia
SWE-Agent | An LLM-based Software Engineering Agent
Elvis Saravia
[LLM NEWS] AlphaFold 3, xLSTM, OpenAI's Model Spec, DeepSeek-V2, OpenDevin CodeAct 1.0
Elvis Saravia
LLM-powered tool for web scraping #ai #chatgpt #engineering
Elvis Saravia
Learn about LLMs in this NEW course #ai #chatgpt #engineering
Elvis Saravia
[LLM NEWS] KANs, Gemma 10M Context, OpenAI Updates?, Automatic Prompt Engineering, Tokenizer Arena
Elvis Saravia
[LLM News] GPT4-o, Project Astra, Veo, Copilot+ PCs, Gemini 1.5 Flash, Chameleon
Elvis Saravia
Enhancing Answer Selection in LLMs #ai #machinelearning #engineering
Elvis Saravia
On exploring LLMs #ai #promptengineering #chatgpt
Elvis Saravia
Transformers Can Do Arithmetic with the Right Embeddings #ai #machinelearning #engineering
Elvis Saravia
[LLM News] xAI Series B, Codestral, LLM Guide, AutoGen Course, Symbolic Chain-of-Thought
Elvis Saravia
PR-Agent #ai #gpt4 #software
Elvis Saravia
Extracting features from Claude 3 Sonnet
Elvis Saravia
Has prompt engineering been solved?
Elvis Saravia
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Python
🎓
Tutor Explanation
DeepCamp AI