Utku Evci - Sparsity and Beyond Static Network Architectures

Cohere · Advanced ·📄 Research Papers Explained ·3y ago

Skills: Research Methods90%Reading ML Papers80%Paper Reproduction70%

Key Takeaways

Utku Evci discusses sparsity and dynamic training in neural networks, focusing on dynamic sparsity as a popular research topic, and explores the benefits and techniques of sparse neural networks. He also delves into the concepts of retrieval augmented generation, fine-tuning, and sparse architecture.

Full Transcript

foreign thank you so much for being here today my name is Madeline I use she her pronouns and I'm with cohere for AI as community and Outreach specialist it's my absolute Delight to be supporting our research lab particularly the community side of things and today's supporting two fantastic Community organizers nahid and Harsha who have volunteered to organize and present this this presentation today thank you for being here if you have any questions as we go please feel free to put them in the Q a at the bottom of your screen and we'll get to them as make sense in in the flow of the presentation without further Ado it's my pleasure to hand things over to nahid hi hi everyone thanks for making time today um so I'm nahid you might have seen me lurking around the cohere for AI Community for some time uh so I'm a computer vision engineer um I'm working at a camera company so I'm very interested in uh building like sparse model prune model um and we recently noticed in the community that there are other folks like Harsha and a lot of other folks who are interested in this similar space um so Sarah actually mentioned about utco and his work in sparsity especially on the I guess Dynamic sparsity um so we decided to invite him um so maybe I don't know Harsha if you introduce want to introduce yourself and then maybe you could introduce and then you go from there Community as well uh my I mainly work at the intersection of theory efficiency Etc and deep learning and I'm really excited for this talk uh and it's all definitely something that has been really a popular research topic and more and more research is going into this area so uh thank you for thank you to nahid and everyone for here for AI for helping set this up and looking forward to it you can go ahead um thank you thanks for I guess inviting me and um I'm following I guess coher for AI um through Twitter and uh it looks like a lot of um impact and different people working on a lot of different things it's good to say that in the community there is alternative paths and um yeah I guess I would like to start with thanks for doing this and thanks for inviting me so um I am a while I share my screen I'm a researcher or it's already please share do you see my screen foreign great uh so I'm a researcher in Google brain uh in Montreal and I started with the residency after my masters at NYU and I've been working on sparsity sparse training particularly and efficiency in the last three four years and and yeah like during this time I got the privilege to work with a lot of amazing collaborators and published some research and I uh put their names here and um to acknowledge them so I won't be like I think actually I will be highlighting the individual collaborators in the in the projects but um I just wanted to highlight here too so um maybe let's start it so also like if you have any questions on any of these topics feel free to interrupt me I will be trying try to follow the chat um and also my voice is a little bit uh risky uh in the sense that it might I might lose it at some point it gives a little [Music] um weird so if that happens I might kind of become a little bit more silent towards the end so today I would like to start with so the the title of the the talk is sparsity and dynamic training so I would like to start with like uh the motivation and the the usual question why uh and why do we want sparsity in our neural networks um this is kind of I think you'll Post in the sense that uh all of our neural networks are already sparse you can take any neural network and convert to a one big rnm that is sparse and that may or may not have parameter sharing but it's highly sparse so the question of like why sparse is therefore a little bit uh uh off because again like we don't connect all of our neurons to all other neurons so it's only connected layer biased and and therefore like um sparsity already there so I guess the question should be like why do we want more sparse than the existing neural networks and the answer to that is uh the scaling curves we observe that like larger neural networks get better results and um in return that means that if you can make a neural network more efficient you could actually increase your initial capacity to to get to a better uh better uh Performance Point in a way like by actually increasing the efficiency of your model you are pushing your scaling curve towards a better a higher performance scale so Innovative efficiency is equal to better accuracy thus um if you can make the existing Network sparse you will get better scaling curves that means like for the same flops or for the same parameter account it will get better performance and this has been shown in the past but just wanted to highlight a few papers here one of them if General audience is here they show that um one second I'll try to get the pointer working here they show that sparse networks which is the right side of the plot gets better negative so lower loss and all of these points are networks same parameter count so if you make your network sparse you get you actually better performance at the same uh parameter count so this is for audio synthesis and here a similar result for Roberta architecture here the the curve that gets the best performance is a large model that is pruned uh with sparsity so this better scaling uh is observed in different domains I mean I didn't put the computer region here that's the most obvious one uh and and then this is the reason why we should uh think about like how can we make our neural networks more sparser and the main question I think therefore we should be asking is the the hubsparsity like how do we want to introduce sparsity in our neural networks um and there's different dimensions one is like you can have activation sparsity in your neural network so these are data dependent um the the data that is created through the neural network and you have parameters which are constant that are data independent so you can introduce sparsity in both of these you can have a static or dynamic sparsity again like activation sparsity is more dynamic because it depends on data in a similar way you can think of dynamism through time so you can change your sparsity pattern uh or amount throughout the training so that's what I mean uh by by dynamism Dynamic versus static again you can think about like a constant parameter budget architecture um budget architecture or with other kinds of budget you can still think of like reducing pruning and growing operations as a part of the the training recipe um in a fix this could be kind of like Reviving existing neurons and if you don't have a fixed parameter budget you can think of like increasing or decreasing your total Network size uh through pruning and growing you can have different granularity of sparsity like one big way of like introducing esparcity removing entire layers uh you can remove entire neurons you can kind of think about blocks in your weight metrics and then sparsify them and recently there's also like an M Sport City supported by the Nvidia gpus there's different granularities and the rule of thumb less structure you put on on the sparsity itself better performance you get but again this structure allows you to accelerate uh and they're like more Hardware friendly you can think of like single versus parallel pads there has been some research again on sparse plus dance architectures so you can replace existing layers with a family of sparse layers or sparse plus small dense layers and this has been shown to be quite uh promising there is the butterfly matrices this um from like baby tree from Stamford to this family of papers that are quite promising and they use this like idea of parallel paths and that is a quite interesting Direction and finally like how do we distribute our parameter and computation across our neural network across time in a way and this is also quite important like uh understanding that and understanding the the principles there is quite important so um I've been worked on I've worked on different parts of these questions I mainly focused on parameters and dynamic sparsity I worked on pruning and growing unstructured uh single like a pruning an existing layer or spice to find an existing layer and also worked on a non-uniform sparsity but today I like to focus on this dynamism part and and talk a little bit more about sparse training and and then growing so agenda today I thought and I feel like given that we started a little late and in my voice and um it's already 15 minutes in I think uh I might not be able to talk a lot about the last part which is uh I will probably quickly mention but the rest is uh there's still focused on this training uh part of dynamism and I'll mention uh what do I mean by that uh in in a few slides so I would like to motivate like why dynamism is uh needed or or beneficial and then I will dive into some of my work and or our work around sparse training growing and and particularly applying sparsity in different so let's start with dynamism and I pasted this famous quad it is not the strongest of the species that survives they're the most intelligent it is the one that is most adaptable to change apparently it's not a quote from Charles Darwin and maybe from Leon megatsen we don't know it doesn't matter uh like Charles Darwin has a lot of other great research um I've been like listening this book on where the Innovations come from and there's a lot of ideas and a lot of those ideas include some kind of dynamism mistakes and variety brings uh kind of like improvements often and uh or like Innovations and in a similar way we can kind of think about like a neural network that is fixed and constrained to the the existing architecture uh is less likely to discover uh like more novel ways of learning and we can also look at our brain like our brains kind of growth synapses and then there's some pruning throughout our life but there's quite a bit dynamism that in in the way our neurons are connected so there's all kinds of motivations for for like why to think our neural networks the need for thinking about neural network training in a more Dynamic way and we can have different ways of introducing dynamism in neural networks one very uh maybe straightforward direction is the training itself so we often have like these fixed structures there's net 50s vits Transformers um and we often just train the parameters right like we don't really change how the neurons are connected and and to me that seems like a big limitation and then in the past and I will talk briefly uh we applied this dynamism idea in a constraint way but like uh in a like this inspired by this changing the architecture Direction They apply that to the sparse training and and then and then improved uh the training performance uh so this kind of like dynamism would be one way and another one is the execution so there has been quite a bit work on sparse mixture of experts so conditional Computing you can look at your data and then pick some of your parameters or modules that uh is most uh relevant to the data that you've seen and that could help you to reduce your uh computation for period and one another way that we can think about dynamism is like this deployment or recycling of the the previous models or checkpoints so we can uh also think this transfer learning or iterative learning ideas in this like dynamism where we will in in in which we look our neural networks um in a in a life where they may be pre-trained and then used and used again and they constantly change and some examples I put I'm going to skip this because I already mentioned so there's already some evidence in the literature that shows benefits of adapting such Dynamic um approaches to the neural network training or deployment or execution and to summarize so why should we care about Dynamic neural networks um one is like we probably don't have we're not going to be able to find an architecture that will be perfect for all the tasks that we might want to use machine learning in and obviously one easy way of doing is like well it doesn't need to be perfect let's kind of use Transformers it seems to be good enough but if you can kind of figure out how to discover this this algorithms in an in an efficient way then like we could actually think about like finding the optimal architecture for each each task itself and and and therefore uh there is I think quite a bit room for improvement in that direction to to to kind of learn the architecture itself but it needs to be efficient and other one is like more for Less you like you can use uh these ideas to come up with more efficient optimization algorithms and now this matters a lot there's startups that focus on that like how can we reduce the the cost of our training and dynamism could be one part of it like one easy example you can start with a small architecture and grow that into a bigger one as a part of the training um so that's another [Music] um they that we can use dynamism um and then finally incremental learning like if you have if you're thinking about like learning lifelong learning agents that learn from users or adapt to them then we can we talk about like some incremental learning setting and they're like you have to uh by definition uh be more adaptive and only changing the way it's probably not going to be enough so there we will probably think about like new output Heads new inputs um data and and and then some kind of adaptation that goes beyond parameters and then finally uh ideal at least in my world an ideal uh neural network algorithm should reach to AGI and it should always improve given more data and in compute however in our current Paradigm where we have a fixed neural network that is not possible in the sense that there will be always a limit that when time goes to Infinity our performance will be kept because of this this limitation of fixed architecture and and the hope is maybe with this Dynamic training direction is that like we can utilize this space of all possible architectures and given more training or more data we can get better results so at least like guarantee that such such Improvement would be quite uh I think interesting um there is um one question yeah yeah someone is asking I think on the chat is like uh yes I'm biased uh probably on my like my Twitter activity is probably a bit biased but I've been following Mosaic ML and they've been uh reporting quite a bit uh improvements on different training recipes language models or vision and um and there is I think more and more uh focus on improving that part of machine learning which is like even for research if you run a bunch of experiments and if you can run them more efficiently or run two experiments instead of one those those are great benefits and I was uh pointing that out I hope that answers one question I have from my side is when we say Dynamic spark City it's really on the training side right like not after the fact exactly and I will talk a bit more about on that and uh yeah the the the the dynamic Sports City that I will be talking about is exactly starting with the sparse architecture and training that efficiently cool I will close the chats and and now I would like to dive in a little bit of uh the the papers that I've worked in the recent past the first one is this um two papers one of them we focused on as I was referring to this parse training problem um and the second one is a follow-up on understanding that uh algorithm that we proposed better and also understanding like why um military tickets get good performance and I would like to I will mainly focus on the first paper here but uh quickly mention PIV key results from second work too so how do you find these sparse neural networks I'm remove the slides for what I mean by sparse networks I hope that is clear to everyone but we are talking about in let's say resident 50 and each layer often every neuron connects to the every neuron in the previous layer and when we say sparse it is going to connect only a subset of previous neurons so there will be connections between neurons and there will be some that normally exist but we we we would put them in the sparse Network so how do we get them as I just say you can start with Advanced Network and prune to get this sparse Network this is the brown curve here this is resnet 50 trained on imagenet you get pretty much the same performance at 80 percent so you remove 80 of your parameters and the performance decreases with higher sparsity but overall it gets decent performance and one observation uh made uh by various people but like maybe became more popular by the lottery tickets this paper was that if you were to train the same exact sparse Network that is found by printing from scratch we don't get good performance there's a significant performance Gap as you as you see here um between the Sprint Solutions and the The Spar solutions that are trained from scratch and luxury ticket actually initialization which is which uses the original dense initialization uh helped this this gap for smaller settings and it it that uses lower learning rates however if you look at this address that 50 scale it actually doesn't matter the initialization itself and this this Gap Still Still persists so in this work we kind of asked this question uh there's this huge gap can we train uh and it would be nice to kind of bridge this camp and and we asked the question of can we train this fast networks from scratch end to end without ever needing data that's parameterization and without sacrificing performance so we want to be able to match the stance the sparse running performance and the answer was yes and we called this method bringing the lottery because we were able to make any sparse random initialization uh and we can we could train any random spicy as far as initialization to a good solution and sometimes we even suppress the printing performance and the idea you start is the following you start with a randomly initialized sparse Network and um you train it for a while and after a while let's say at the end steps you reconsider the the sparse connectivity in your architecture specifically for each layer you look at the baits that exist in that layer and prune a small fraction of them [Music] but by looking at their magnitude so if let's say we are going to we have 20 of the the weights in that layer we will remove uh 30 of that 20 and um and then we will grow new connections uh so in a way like we rewire [Music] some some of our neurons and the total parameter count in the layer doesn't change during this so we remove let's say 10 connections and add 10 new connections and these new connections that are added we find out that if you look at the gradients uh you get significantly better performance than activating new connections randomly and this gradient is calculated every n steps and you don't necessarily need to store them you only need tab K because you can actually implement this efficiently without requiring like this full memory of advanced Dance Network therefore like since we trained sparsity and the full gradient dance gradient that we need uh is um required by tab K which means that you don't need to materialize it uh this algorithm is fully spiced in a way like if you had an an architecture that supports police cars training this would uh utilize it in the sense it will never need uh this full dense parameter project or or member footprint so this is the summary I going to skip through the remaining because I want to also talk about the remaining of the slides but we find out that if you train your sparse network with Regal you actually learn to discover the right features so this is the first stair in a flattened MNS and we visualize like number of connections from each pixel and wiggle basically learns to allocate most of its connections to the center of the image and if you know Ms this is kind of where the information is which is a cool feature and if you look at this training flaps versus accuracy curves uh legal that's best compared to the other bass lines and matches the pruning performance using similar amount of flaps and if you look at the uh so this is for regular uniform sparsity another main finding in this paper was that if you smartly distribute your parameters across different layers instead of doing like if let's say 80 sparsity is my Target and I'm going to allocate 80 sparsity to every layer and that strategy is called uniform however you can actually do better by allocating more parameters to smaller layers and less parameters to the bigger layers in terms of uh percentage and if you do that this is kind of what erk does you get you get actually significantly better performance so the purple curve at the top is better than the blue one and actually if you look at high sparse the regime you you get even better performance than pruning so this is kind of like highlights actually end to end the the utility of end-to-end sparse training and also more uh specifically more importantly the utility of dynamic training so here if you don't change the architecture itself so you stick with the initial random sparse topology that you started your training with then you get this sub-optimal static line which is significantly worse than the dynamic express training alternatives and one last thing I want to highlight here was so we always talk about cleaning and efficiency like we have this test at work we kind of make it smaller heavy that's sacrificing performance another the the the same thing so the other side of that coin is uh keeping the total number of parameters same and increasing the sparsity and the video bit of your neural network simultaneously by doing that you can get like the spider network but sparse Network wide sparse Network that has same number of flaps or parameter count as your original dense Network however this network due to being like wider will perform significantly better and this is being shown at the by the last two rows here we have big sparse models these are same flaps as the original lens and they get like four percent better performance for example on imagenet and this is kind of like again whenever we see efficiency we should also kind of immediately infer that this also means better accuracy because um with the budget compute memory that you save you can scale your neural network and we know that larger networks do better and thus uh they are as important as uh improving so I I repeated that again there is one more question from diganta maybe I can pause for that yeah I could probably read it out for the others as well uh so um uh so sparse fee as a keyword can be very ambiguous is uh his comment and sometimes these refer to weights Varsity or activations Varsity or that of routing based mixture of experts so uh digantas if there's an established definition of what underlines parity it's a pretty good question yeah yeah I mean I think if we look at the the and thank you for the question firstly um and second uh if you look at the the vocabulary I guess we will get a definition and that is probably a good definition like a given a set of elements only a subset of is active and you can apply that to weights you can apply to activations um and I talked about this at the beginning of my talk like kind of when I was talking about how sparsity it was activations versus weights um there is I guess like we should make it clear and we say sparsity which one do we refer to it was very bait sparsity was very popular so before I guess it was mainly used for that but now like we also think about activation sparsity um yeah I think we can just like append it by saying uh wait spice their activation sparsity at the beginning and hopefully we can um infer from that if we continue speaking with sparsity which one do we refer but I'll try to be more specific going forward when I say scarcity but here so far I talked about Vince varsity foreign you can train more with your compute budget is in thus Parts representations present data better within the network gaining better performance um I mean one way one reason that why sparse networks so thanks Max for the question uh I've read the question but maybe it wasn't good enough if I don't know whether anyone sees the chat everyone I think that kind of read it for myself no no it was good I think yep okay so max was asking like why sparse networks knows better one obvious reason is the bit so you actually increase your bit and increase the sparsity so that you have the same parameter budget and we know that like wider networks these are the scaling curves that we all know um brings better performance so the wider Network more parameters often given enough data will give you better better performance and that's the I.T main reason why sparse networks does better than the dens and apart from that you could there is also some evidence that you can compare two networks one is dense when it starts at the same bit so the sparse network will be more efficient and there is some evidence for especially smaller data sets that sparse networks sometimes generalize better and they even sometimes have better adversarial robustness so I can't remember the name of the papers on top of my head but if you there's a pruning survey and maybe those are cited there but there's some line of work that looks into these kind of characteristic at the same bit so like if you kind of compare two networks with higher widths then obviously like higher width is a different like it has more expressive you can come up with I think uh ways to show this even using math um so but like in the same bit thing even there seems to be some benefits to using sparsity uh you know you know like in a way like it's a better regularizer it's a way to regularize your your network so that was a long answer of that uh addresses it cool did I miss any other questions I think there was a follow-up question from uh I think Jason also had a similar question to Maxis I hope he got his answer but uh diganta is asking uh what is the intuition on why lottery ticket initialization doesn't work well when praying from scratch I think to follow yeah I uh will answer that I also see Jason has a question um maybe I can quickly answer that one person and go to the digantas um so in this plot the big Spar says same number of parameters as the the dance Network um so if you have uniform sparsity this is the 76.4 it also has the same amount of flaps if you use the RK you actually double your flaps at the same parameter count so all of these three models the same parameter Gap but the RK has more flaps because of the the sparsity distribution uh and and it gets better performance but like it increases the fluff so yeah to answer your question all these models have same parametric that as that's um and diganta's question on uh why not three tickets doesn't work on the resident 50. so we uh in our follow-up work uh gradient features um sorry gradient love uh the chats box kind of who covers the the slides so I was struggling a bit sorry about that um and in this paper great influence personal networks we investigated the initialization for surprise Network so one thing that we realized is that if you have a heterogeneous sparsity which means different neurons have different number of incoming connections due to sparsity then you have to adjust your weight initialization uh so that you preserve the the unit variants at every neuron so this seems to impact the the results a lot in earlier or for earlier architectures like BGG or networks like smaller networks without like patch Norm or skip connection tricks however if you go to this area this more recent families of architectures and use them You observe that the importance of initialization is is is is less um all these tricks about batch norm and Skip connections help you with training a lot and even though your initialization might may like often our Industries you know not perfect and that's completely fine and that's what we also observed that changing the sparse initialization didn't affect the results a lot at all actually in resnet 50 for example in those architectures we find out that the gradient Norm during early training um impact the training results a lot so Regal for example here you see a plot we look at the gradient Norm uh we copy the gradient of the weights and calculate the norm of that flattened vector and if we plot that we observe that legal updates you change the connectivity increases this metric early in the training and that correlates with this performance that better performance that we obtain and however as I said like changing the initialization doesn't help and to answer is diganta's question um the lottery tickets doesn't work because this resonate 50 architectures um have such a high learning rate and they are optimized so that you know those big steps doesn't uh affect your overall optimization uh due to those like large steps and tricks that managed to keep the the training in place initialization is kind of irrelevant because you have this catapult steps at the beginning that makes you jump uh a lot and the individual initialization doesn't matter at all and maybe like another thing that I can say to answer that question is this uh final key result from this work which is um we looked into the similarities between Lottery Solutions and the pruning Solutions uh if you are familiar with the lottery ticket initialization what it does is it makes a pruning run so you start with a dance Network and prune that dance Network during training and you get a pruning solution that often performs really well if you remember the curve with the brown result so that's printing that does really good and what logic does is like it gets the mask from the solution so this is the same connectivity pattern in that Network and uses the initial initialization which the initial initialization used for the the original pruning experiment so the exact initialization used for the then that dance Network and apply the mask that you find through pruning and you get a sparse initialization right and if you train that you sometimes get good results and what we did is we looked how similar this lottery ticket Solutions and the pruning Solutions are and what we find out is the distance wise and the function similar to eyes these two solutions are kind of next to each other in a way when you apply this mask found by the pruning solution at the initialization you nudge your initialization towards this pruning solution that you found and if your learning is such that you don't jump a lot or your initial learning rate is not high enough you are able to relocate this Basin that that includes the good solution and lottery ticket works if you if it can refines the the pruning solution and uh it doesn't if if you know due to large learning rates uh um you maybe like jump for example invest at 50 maybe you start like this but due to large learning and you jump to this like other parts of your energy landscape and you are not able to find your pruning solution again and so that was a long answer helping answers um okay so we have uh 18 minutes I will continue um the second work so so far we talked I talked about uh sparse random sparse networks trained dynamically using gradient information and all of this work was about the the connections uh the edges between neurons and in the next work we kind of uh looked into can we use this idea of maximizing gradients or improving our the gradient flow during training to dance neural networks so can we grow neurons so that our overall gradient Norm is increased and um that lead to this work uh that we call grad Max uh gradient maximizing neural network growth which we presented at iclr last year and the motivation for growing is uh the following uh I kind of already mentioned part of it for the motivation for the dynamism but continue learning where you learn new tasks ideally you want to also grow new capacity so progressiveness is one example where you increase your capacity throughout your learning process secondly you can you can use growing in a architecture search setting you can think of like starting with a seed architecture and finding the best places to add near capacity and and by doing that you can hope to achieve or find better architectures this is I think an exciting word that could enable this ideal setting I was talking about where you optimize the architecture and debates together so that like when you train longer you always get better adults and finally you can start with small architecture and grow it and this has been used actually widely I think more and more especially with very large models people often don't train it from scratch and if something changes people often warm start with previous and previous checkpoints and results and often Grove new capacity whenever needed and um this helped for example open AI in the Dota 2 project to be able to experiment and reduce the overall project type so the existing work in growing can be grouped under two categories one of them is splitting so if we have a neural network a toy one that has two neurons in this layer we can think of splitting as copying one of the existing neurons or group of neurons and then adjusting the article in weight so that the sum of these two weights equal to the original outgoing rate so this will be the same neuron duplicated and so the network will produce the same thing but now you have more capacity and often you add a small noise to this identical neurons to kind of break the Symmetry however as you can see there's like a lot of repetition right like you create new neurons that does pretty much the same thing as other existing neurons which is often like not an ideal thing to do and other family ads new neurons that are independent and ideally orthogonal or like somehow orthogonal to the existing neurons and uh there is some work also on this uh previous version in this work we will focus on this adding because of that limitations I mentioned about about splitting and also like in our experiments we see that the adding is a better strategy than splitting so the motivation here we will try to maximize the gradient so we'll add a neuron and we'll try to maximize the gradients uh through this like new added capacity one thing that we want to do is like we don't want to change the output of this neuron which means we're going to make sure either incoming or outgoing rates are zero and again which means that we're not going to be able to change anything including weights and gradients about the existing capacity so the only thing that we can maximize is the capacity's new capacity that we are adding and here I like to quickly motivate why maximizing gradient is a good idea it comes from the Taylor approximation so if you change your weight slowly which means in our deep learning setting the small learning rate if you have a small learning rate you can fit a quadratic model and think about like this Minima as the the combination of your existing loss plus a term that that depends on the gradient itself and when you maximize the gradients you basically increase this term which means you're decreasing this loss value at the minimum and in other words you are guaranteed to get better or smaller training loss in your next iteration everything else being equal if your gradient is larger you will decrease your loss faster and obviously this is for one step and we often take multiple steps in our training so the key here is like improving grade audience throughout the training and and our hope is like if you can maximize gradient at during this growing steps that will sustain for a while and overall we will get a faster optimization and uh I mentioned already this so specifically in grad Max we will set incoming rates to zero because this has like this nice property where your activation is zero and if you apply nonlinearity you already know it is going to be zero like depending on which activation but like there's something some nice feature about um the the the the your activations being linear and um we will optimize the outgoing weights such that so these are this is the optimization basically we are doing we will maximize the this this value by choosing the outgoing rates and this value that we are optimizing is the gradient in incoming rates and why that is is the gradient at the outgoing weights are zero due to this zero Activation so the only non-zero gradients are uh in in this uh incoming connection site and uh we will also want to have a text Norm for the outgoing mates because by changing this Norm you can arbitrarily maximize this part so we want to fix the norm of the outgoing rates and in this paper we showed that like this problem reduces to a spectral problem where there's a matrix that we can efficiently calculate and if you calculate the top K eigen vectors of that so there's a typo here it should be eigenvectors so the top K eigenvectors of this Matrix is equal to the optimal initialization for outgoing weights that maximizes this gradient for a fixed Norm and since like we set incoming base to zero and like the rest of the network doesn't change due to this zero initialization we can solve uh this uh spectral problem independently for each layer which is paralyzable which is great which is which is a very good property um and sometimes in certain settings actually this closed form solution doesn't exist like one example is like if you were to set outgoing base to zero and initialize the incoming waste to an unzero value then this cross form solution doesn't work and in such settings we can still use the same idea same optimization problem to come up with an initialization and we do that by using like regular gradient based optimization you can optimize this value uh using um automatic gradients or like any of your favorite uh you know metric Library so uh our experiments uh starts with this toy setting where we have um student teacher uh networks so you have one network two layer MLP initialized randomly and that generates the data and there's a second another randomly initialized student network with the same architecture that tries to learn from the data that the teacher created and the nice thing about the setting is that if the student can learn the weights of the teacher um it can actually get zero training loss so there is a global Minima at zero in terms of uh service rate in terms of training loss so um this is a good thing and uh here we have different experiments um uh in in this setting and first we look into the the gradient Norm after the growth so we measure the gradient Norm uh after growing in capacity and and observe that indeed Brad Max get the best performance uh we compare it with random and the optimized version um improves over random however it is not as good as the closed form solution and if you look at the gradient Norm over the course of the training it kind of goes up uh at the time of growth so here we grow five times at 200 400 and the grading Norm decreases however it is sustained uh higher uh compared to the random which is a good thing which is what you need to really get a better training loss and we can also see this with the uh how much loss improves compared to the the Baseline of growing random randomly um we have this full training curves where this gradient-based growth does better than random one thing that I want to highlight is here is that like you start with a smaller architecture which is this green curve I think in this case it has five neurons at the beginning and there's this teacher architecture which is you start from the screen in a way and grow it into the red this larger architecture which has 10 neurons in the hidden layer and ideally the grown architectures should match this this thread curve however what we observe is in smaller architectures this is not the case so there is this Gap that we should be able to bridge but even the the gradient maximizing initialization doesn't entirely solve it improves over random but there is still this Gap this Gap doesn't exist for this larger setting and we also have resnets vggs mobile Nets on C5 and imagenet data sets the improvement over random is marginal um there is half percent or one percent Improvement which is not great but more importantly there seems to be a big gap still between the ground networks which are all these three lines here and this big Baseline which is the training the large network from scratch and ideally this curves should match this one as I said so there seems to be some uh I think exciting and important research to be done to understand how can we like why this happens and and then bridge the gap itself so this was grad Max and if initialization so given that I have six minutes I will um just maybe say one thing about uh this work so recently so here it says soon at icml I guess that passed um so we presented this work at icml this year and the motivation was that all the works that I talked so far and most of the research in sparsity is done in computer vision and this is a little um troubling because computer vision is already we are very good at and a lot of problems in the world like this is one piece of the very big puzzle like the kind of things that we can use machine learning at computer vision is I think a small piece and if we do all of our baselines and benchmarks in this domain we we have this danger of overfitting our algorithms our recipes towards this and one thing we wanted to do is to apply and compare this price training and and pruning methods in in the pre-inforcement learning and that's what this work is about so we apply sparse training algorithms Dynamics cross training algorithm static sparse training in different RL environments and I think RL is quite interesting because of this challenging training Dynamics the data distribution changes over the course of the training and I think it is more realistic than the the computer vision classification tasks that we all love and use so there is a need for I think applying these algorithms to more realistic settings and hopefully this work was one step towards that and we use like these methods and maybe um just pause in the slide for a second but we observe is if you use non-uniform spicy then that was the key thing that changed the performance uh significantly so if you played with RL and if you look at this architecture use the URL there is a lot of imbalance between how large the you know layers in between in the middle are compared to the first or last layer and often in the past when people prune these networks they uniformly pruned all layers and when we apply the arcade this distribution that I mentioned that allocates the weights more wisely we we observe significant improvements and in Sac in mujoko and in dken in Atari we were able to match the original dance performance with 90 sparsity on average and and before it was around 50 so that's this is kind of like a good news I guess for sparsity and secondly uh we had this surprise result on resnet backbone um and and the can algorithm trained on attorney they're actually pruning improved results significantly so this is the original then space line we put 90 and we were able to get like uh almost double or like 50 better results compared to the dance Network so this is both smaller but also somehow uh it got better results so this is something that you're also looking into now but uh hopefully this highlights the importance of like thinking other thinking about thinking of and using different domains in research and and and and you might be surprised actually with the results in a good or the bad Bay and both are important for uh improving our methods so I will pause here there's last two minutes uh there's questions and maybe I go to my last slides yeah today I talked about I didn't talk about head to toe but this is more about like structured sparsity applied for in transfer learning so feel free to check that out talk about Regal which is dynamic express training algorithm grad Max growing neurons and and hopefully these and there's a bunch of other work too that that motivates the promise or importance of dynamism uh in in our research and in the uh training of neural networks and similarly the importance of sparsity and yeah feel free to follow me on Twitter and I would like to conclude with uh it's not a single resident Transformer or mixer that survives hard to remove it will be a dynamic neural network that is most adaptable to change I guess is my prediction and with that I would like to conclude then maybe I can take the questions thank you yeah I think there was one question a little while back by David uh which we could start with three liquid is the question is is there any overlap correlation Etc of parameter signs in sparse architectures and the initial dense architecture and is there any work generally on that so I think that was yeah I think it is called super mask so uh Hattie uh okay I cannot but yeah so there's this work from Hattie and others uh from Uber AI which I think cease to exist um you can look into that work there is some correlation indeed no wait it's not super no it is super masks uh let me find the name of the yeah it's called deconstructing Uh lottery tickets so there seems to be some correlation um in the lottery ticket setting uh uh and that is um I don't exactly remember their conclusion so I guess probably best to check that um I don't know how to send a message uh I can send it to Backstage it looks like yeah I I could re-share it on chat uh really quick yeah I think go ahead yeah and I think the next question would be by Jason really quick is uh do grown networks here end up with the same architecture as that of the Baseline I think uh assuming uh the papers that was yeah yeah in that paper yes we wanted to really focus on the effect of growing so we kind of had a fixed schedule on where the neurons are grown um and yeah so the all architecture is in that plot especially at the end so you start with the small architecture and grow that small architecture into the same big architecture and at the end all of the grown architectures and the big Baseline has the same architecture and I believe the last question if we have time is uh diganta asks um uh if you could highlight how grad Max doesn't violate violate uh probably um so doesn't violate normalization we need to change the unit variance so um for there's for batch storm um you can initialize it um so in The Bachelor the tricky thing is like if you have a zero activation batch storm will go nuts like if it's always zero and connectivity so now you have zero variance and it will try to scale it and that won't work and the Epsilon used in Bachelor will kick in and that will also mess things up for the gradients because you will divide things with that Epsilon and that means you will also divide your gradient with Epsilon so what we did uh like and there's some fixes you can do for that but uh for batch storm experiments if I remember correctly we did the other way around actually so we set the incoming rates uh so that the gradients in the outgoing rates are maximized so the output is made sense zero and we maximize their gradients and in that setting it's just like your neuron will have some activation and it's going to be normalized with batch Norm that doesn't work with layer Norm as you may imagine because layer normalizes using all the all of the neurons and if you have a non-zero neuron that will affect and change your function so we discussed all of these basically in the paper if I remember correcting the appending set in the main text so um there are ways of keeping the the the function itself and some of them works with the closed form solution grab Max and if it doesn't work we do the optimized version of grad Max foreign thanks for the questions again everyone uh thanks for joining yep thank you so much and being under the weather uh and still joining in it was an excellent presentation at least uh for as someone who also works in this area I've got a lot of inspiration I'm sure it's for the others as well for it interested in the general topic and thank you so much for joining once again and uh yeah like that that should be all over the model yeah sure okay foreign this thorough presentation while under the weather and such good engagement from our community so I think everyone's walking away there's a little bit more to think about have a great rest every Friday everyone yeah thank you everyone see you bye for now [Music] thank you [Music] foreign [Music]

Original Description

Hosted by Cohere For AI Community members Nahid Alam and Sree Harsha Nelaturu. Utku Evci on Sparsity and Beyond Static Network Architectures: Going beyond static architectures and using dynamically (1) trained, (2) executed or (3) adapted architectures has been shown to provide faster optimization, better scaling and more effective generalization. In this talk I will give a short overview of these results and share some of our recent work on dynamic training and adaptation of neural networks. On the dynamic training front, I plan to discuss our work on (a) training sparse neural networks and (b) growing neural networks, both of which use gradients as the guiding signal to update architectures during training. I will conclude with our recent work on (c) reinforcement learning. Paper links: (a) https://arxiv.org/abs/2010.03533 (b) https://arxiv.org/abs/2201.05125 (c) https://arxiv.org/abs/2206.10369 Utku Evci is a researcher in the Google Brain team in Montreal and studies efficient training and adaptation of neural networks. He participated in the Google AI Residency Program during 2018-2020 after completing his M.Sc. degree in Computer Science at NYU Courant. Scholar: https://scholar.google.com/citations?user=8yGMMwcAAAAJ&hl=en Twitter: https://twitter.com/utkuevci This session is brought to you by the Cohere For AI Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. Thank you to our Community Leads for organizing and hosting this event. If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. Join the Cohere For AI Open Science Community to see a full list of upcoming events: https://tinyurl.com/C4AICommunityApp.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Cohere · Cohere · 32 of 60

← Previous Next →

Andreas Madsen on Independent Research and Interpretability

Andreas Madsen on Independent Research and Interpretability

Plex: Towards Reliability using Pretrained Large Model Extensions

Plex: Towards Reliability using Pretrained Large Model Extensions

Independent Research Panel Discussion

Independent Research Panel Discussion

The Future of ML Ops: Open Challenges and Opportunities

The Future of ML Ops: Open Challenges and Opportunities

C4AI Special - Grad School Applications

C4AI Special - Grad School Applications

Cohere For AI Fireside Chat: Samy Bengio

Cohere For AI Fireside Chat: Samy Bengio

Cohere For AI - Scholars Program Information Session

Cohere For AI - Scholars Program Information Session

Modular and Composable Transfer Learning with Jonas Pfeiffer

Modular and Composable Transfer Learning with Jonas Pfeiffer

Jay Alammar Presents Large Language Models for Real World Applications

Jay Alammar Presents Large Language Models for Real World Applications

Catherine Olsson - Mechanistic Interpretability: Getting Started

Catherine Olsson - Mechanistic Interpretability: Getting Started

How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners

How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners

C4AI Sparks: Samy Bengio

C4AI Sparks: Samy Bengio

BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1

BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1

Exploring News Headlines With Text Clustering | Jay Alammar

Exploring News Headlines With Text Clustering | Jay Alammar

Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang

Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang

Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney

Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney

Intro to KeyBERT - BERTopic for Topic Modeling

Intro to KeyBERT - BERTopic for Topic Modeling

Intro to PolyFuzz - BERTopic for Topic Modeling

Intro to PolyFuzz - BERTopic for Topic Modeling

API Design Philosophy - BERTopic for Topic Modeling

API Design Philosophy - BERTopic for Topic Modeling

Code demo of BERTopic - BERTopic for Topic Modeling

Code demo of BERTopic - BERTopic for Topic Modeling

Short texts vs long texts in BERTopic- BERTopic for Topic Modeling

Short texts vs long texts in BERTopic- BERTopic for Topic Modeling

How People can help BERTopic - BERTopic for Topic Modeling

How People can help BERTopic - BERTopic for Topic Modeling

Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan

Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan

Cohere API Community Demos | October 2022

Cohere API Community Demos | October 2022

Perfect Prompt Demo By Arjun Patel

Perfect Prompt Demo By Arjun Patel

Project Idea Generator Demo By Tobechukwu Okamkpa

Project Idea Generator Demo By Tobechukwu Okamkpa

SuperTransformer Demo By Amir Nagri and Team Megatron

SuperTransformer Demo By Amir Nagri and Team Megatron

Cohere For AI Fireside Chat: Pablo Samuel Castro

Cohere For AI Fireside Chat: Pablo Samuel Castro

How Startups Can Use NLP to Build a Competitive Moat

How Startups Can Use NLP to Build a Competitive Moat

Build Chatbots Faster with Large Language Models

Build Chatbots Faster with Large Language Models

Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2

Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2

Utku Evci - Sparsity and Beyond Static Network Architectures

Utku Evci - Sparsity and Beyond Static Network Architectures

Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp

Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp

Iterating on your data with doubtlab - Tools to Improve Training Data

Iterating on your data with doubtlab - Tools to Improve Training Data

Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data

Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data

Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data

Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data

Building Cohere API Demo App With Streamlit | Adrien Morisot

Building Cohere API Demo App With Streamlit | Adrien Morisot

Rosanne Liu - career creation for non-standard candidates

Rosanne Liu - career creation for non-standard candidates

Giving computers many human languages with Cohere's multilingual embeddings

Giving computers many human languages with Cohere's multilingual embeddings

Learning by Distilling Context with Charlie Snell

Learning by Distilling Context with Charlie Snell

Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3

Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3

Reflecting on for.ai...

Reflecting on for.ai...

Create a Custom Language Model with Surge AI and Cohere

Create a Custom Language Model with Surge AI and Cohere

Cohere API Community Demos | November 2022

Cohere API Community Demos | November 2022

Cohere API Community Demos | December 2022

Cohere API Community Demos | December 2022

Cohere For AI Presents: Colin Raffel

Cohere For AI Presents: Colin Raffel

Lucas Beyer - FlexiViT: One Model for All Patch Sizes

Lucas Beyer - FlexiViT: One Model for All Patch Sizes

What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation

What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation

Evaluating Information Retrieval with BEIR

Evaluating Information Retrieval with BEIR

Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers

Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers

High quality text classification with few training examples with SetFit

High quality text classification with few training examples with SetFit

Multilingual and cross lingual embeddings - Nils Reimers

Multilingual and cross lingual embeddings - Nils Reimers

Developing open-source software: lessons, benefits, and challenges - Nils Reimers

Developing open-source software: lessons, benefits, and challenges - Nils Reimers

Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere

Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere

HyperWrite Powers Its Generative AI Service with Cohere

HyperWrite Powers Its Generative AI Service with Cohere

EMNLP 2022 Conference Special Edition - Talking Language AI #4

EMNLP 2022 Conference Special Edition - Talking Language AI #4

Cohere API Community Demos | January 2023

Cohere API Community Demos | January 2023

C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates

C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates

Michael Tschannen - Image-and-Language Understanding from Pixels Only

Michael Tschannen - Image-and-Language Understanding from Pixels Only

How to Add AI to your App

How to Add AI to your App

This video discusses the concept of sparsity in neural networks and its benefits, including improved performance and hardware friendliness. Utku Evci explores various techniques for introducing sparsity in neural networks, including dynamic training and sparse mixture of experts. The video also covers the challenges and limitations of sparse neural networks and provides insights into the current state of research in this area.

Key Takeaways

Train a sparse neural network from scratch without parameterization
Prune a small fraction of connections in each layer after training for a while
Add new connections to the layer to maintain the same total parameter count
Calculate gradients every n steps to find optimal connections
Use TabK to efficiently store gradients without requiring full memory
Maximize the gradient in incoming rates by optimizing the outgoing weights with a fixed norm
Calculate the top K eigenvectors of a matrix to find the optimal initialization for outgoing weights

💡 Dynamic sparsity can be introduced through training, allowing for changing sparsity patterns, and can lead to improved performance and hardware friendliness.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

How to Open HSD Files (Husqvarna Viking Designer Embroidery)

File Extension Geeks