From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research · Advanced ·📄 Research Papers Explained ·5y ago

Skills: Neural Network Basics90%

Key Takeaways

Develops efficient deep neural networks using SqueezeNet and SqueezeBERT models

Full Transcript

[Music] good afternoon everyone um first of all uh thank you for attending this talk uh my name is sujeeth i'm a machine learning researcher within microsoft and i'm mostly focused on resource optimized machine learning both computer vision and natural language processing and it turns out forest has been a friend of mine for over a decade is uh is an expert on both topics natural language processing is a very interesting topic both from from product perspective from bing to cortana running things on on device and there's there's industry-wide interest in this topic now forest is here to talk about his journey from squeeze net to squeeze burt and what we can learn specifically from utilizing aspects of convolutional models such as group convolutions to really make language processing much more efficient now here's a brief bio forest is kind of part of the extended microsoft family as well has been in internet microsoft before has given a few talks within our own group as well as the broader microsoft community so let me quickly read his bio forest ian dola completed his phd in eecs at uc berkeley where his research focused on deep neural networks he's best known for deep learning infrastructure such as fire cafe deep models such as squeeze net and squeeze debt his advances in deep learning led to the founding of deep scale which was acquired by tesla in 2019 he's now an independent researcher focused on squeezing deep neural networks he's well known for the squeeze family of models so squeeze now squeeze bert hopefully you've all read the paper if not let's uh let's take a quick look and uh let's hope to learn a lot more about squeezing deep neural networks um both vision as well as language natural language processing models from forest so a forest it's all yours thanks for the um very thoughtful introduction to jeet um so i'd also in addition to thanking such for uh bringing me into to present this to you i'd like to thank albert shaw robbie krishna and and kirk koitzer who are are uh great friends of mine uh some of whom i've been working with uh for many years for their contributions to this work and so what i'd like to do in this talk is um to begin by by reviewing you know in the last five years what have been the the key discoveries and breakthroughs that have enabled computer vision to become much more efficient and then um you know given that natural language processing right now is in a renaissance era where things are improving very rapidly um at the cost of computation however um i'd like to look at if we take the findings from computer vision research and how we make computer vision efficient how might we bring that to bear on making natural language processing also more efficient and as a first step in that direction i'll present some of our our recent work squeezbert and i'll also propose some future directions for this work so starting with computer vision so as many of you will probably point out there are uh probably hundreds of different computer vision tasks and many of which are are quite useful and important but if you look at the very core of the computer vision research community three of the tasks that people spend an inordinate amount of time on are image classification so you look at an image like this on the left and you want to say something about the entire image so you might say sedan or another reasonable answer would be roadway in the middle we have object detection where we want to label the individual objects and decide what kind of objects they are based on on categories and also draw a rectangle or cuboid around them and then on the right we have semantic segmentation which is a very detail-oriented task where we want to classify every pixel based on what kind of object it's a member of and and so here we just have a basic example of stan and road but we can get very advanced with this as well so in the next few slides i'm going to talk about some of the advances in image classification and semantic segmentation in the last in the last half decade or so and look at the improvements both to accuracy as well as to efficiency so image classification if we look at where we were in 2015 or 2016 on the very widely studied imagenet benchmark resnet on here i've shown resnet 50 in resnet 101 um you know was in the the 75 76 top one accuracy range all the numbers i'll be talking about here today are single model um single evaluation so no no funny business with big ensembles or running the image through the network many times or anything like that just just basic basic networks so um and then squeezenet was one of the first networks um in the image classification arena where we really pushed on efficiency and the flop reduction you know versus um versus resnet is large although the accuracy is lower um we also improve flops over alexnet which is more kind of a similar accuracy as a squeeze nut um but the the other big win which which i'm not showing in this slide was that squeeze nut had far fewer parameters than most other networks so that was kind of the lay of the land five years ago and lots has happened since then um leading up to today this this recent family of models called efficientnet and even more recently uh there's this paper from facebook called fix efficient at which which further improves uh the quality of results from efficient that just a little bit and um the way we're showing this is is that the fix efficientnet is a family of of many different models with a common architectural design but that can be scaled larger or smaller so one interesting thing is the accuracy with with you know fixed efficient net especially out on the right here has improved over over res not in the last five years um you know the improvements i think are slowing down a bit year over year versus what they were a few years ago but the improvements are still happening um but the other thing is as someone interested in computational efficiency and making things run fast the other thing that i find exciting about this is if you look at resnet 101 so this this grey dot off to the right here the resnet 101 even resnet 152 have been surpassed in accuracy by the very lowest accuracy version of fix efficientnet and the jump from resnet 101 to the the lowest accuracy fixed efficient net um is actually a 40 time savings in the number of of computations number of flops and i'll talk later about whether flops is an ideal metric but it's certainly a metric that people use and so that's been a big improvement we've also seen similar kinds of improvements on semantic segmentation so here we're looking at the data set called cityscapes that's that's quite popular in the semantic segmentation community and um you know going through time here on on the right we have fully convolutional networks fcn which came out in 2015 actually by some of my friends at uc berkeley when i was a graduate student i did that work deep lab v3 plus is another i think kind of watershed moment for for advancing semantic segmentation accuracy that was a big jump uh interestingly the computation you know on the horizontal access uh was improved with the v3 plus and the accuracy also went up and then finally in terms of very very efficient low-cost models all the way on the left here we have a family of networks called squeeze nas that i helped to develop with some of my my colleagues and um the worst accuracy version of squeeze nas is still more accurate than the state of the art in 2015 and it has 160 times not 160 160 times uh fewer computations than the 2015 state-of-the-art network so we've seen basically both you know if you if you constrain um if you if you if you freeze the accuracy that you're aiming for you know if 2015 or 16 accuracy was was sufficient for your application we can now do that you know in orders of magnitude less compute and then also you know if you if you're willing to to use uh amount of amount of compute that would have been reasonable in 2015 and do that again today you can get far more accuracy than you could a few years ago so there's been lots of progress on this um for imagenet stuff i just talked about that's all on a fixed dataset no dataset changes at all this progress in semantic segmentation here the only thing that really changed in terms of data sets uh from 2015 to present is as people now often pre-train on the coco data set in addition to imagenet and that that often buys you one to two points on this class iou metric but but by and large um these improvements are not from from fearing the data set so much as they're from developing superior neural networks and and training them really well so um if we look at particularly from a neural architecture perspective what has enabled these improvements in recent years to both efficiency and accuracy there are many things but i've narrowed it down to three extremely important ingredients so the first one is what are called grouped convolutions and for those who haven't really thought about this before let me and give you kind of an intuitive explanation of what this is so um what i'm showing here on the left is a weight matrix so i'm just doing a very simple neural network layer um where where the filter size is is one or one by one and then there are eight input channels and eight output channels and normally this matrix would be dense so a traditional convolution or with groups equal one this would be a dense matrix and let me just color some of these these columns so that we can watch the progression as we as we adjust the approach here so um test groups equals one if we set groups equal to four so group convolutions as groups equals four this is the new pattern so all this all these empty cells are now just empty and we have this nice uh horizontally block sparse um banded matrix and um this is something that given that the neat sparsity pattern it's not random sparsity it's you set this varsity pattern before you even start training your neural network for the first time typically you can also take advantage of this when you store your weights so you can you can take b here and and organize it in the way we have in c and and not waste any space so this is the basic intuition behind group convolutions uh this is something that was introduced at least to my knowledge the first time i saw this idea was in the original alex net paper and people forgot about it for a few years for the most part but it came back in the papers like mobilenet and since then this has been a common fixture in the efficient neural net community for computer vision so that's group convolutions the second um innovation i want to talk about that's gone from being a rarity to to commonplace in the design of neural networks with efficiency in mind is dilated convolutions so this animation actually i think explains it better than i could explain it verbally so on the left we have a normal three by three convolution this is just showing with one channel in reality this would be this pattern but over many channels um normal three way through convolution you see the filter is is dense and on the right the dilated three by three it's basically a five by five convolution but with with this pattern of weights removed so it has the the number of computations and number of weights that you'd see in a three by three that has the receptive field of a five by five and um in in many areas of computer vision uh maybe most principally in semantic segmentation dilated convolutions have helped significantly to improve both the speed and accuracy of these networks then the third innovation i want to talk about is neural architecture search so this slide i admit is a bit busy but let me kind of step through it little by little so if we think about neural architecture search one of the fundamental things we want to do is you know we want to design a network we have several ideas for what each layer of the network could be but we want some sort of automated way for choosing what those layers uh will be so we first define a search space so in the top wide box on this slide we have these network layers that i'm just calling module one two and 3. you might think that module 1 is a particular type of convolution dimension module 2 is another one module 3 is the third one and i think from about 2015 or 16 through 2019 the way that a lot of people address this problem of how do we get algorithms to help us design networks where we create a search space and the algorithm tells us what's a really good network in that search base was reinforcement based learning reinforcement learning based architecture search and you know quackly and others have been really big on this in the past and the basic approach is you have a reinforcement learning controller that also has weights and you know you you have the controller propose a family of neural networks those get trained uh the controller looks at the results on some sort of validation set and depending on the results the controller updates its own weights and then uses that to propose a next family of networks and iterates several times and so this can be this can be effective um i believe mobilenet v3 was actually developed using this approach but it's also very costly so you know if you're starting with some sort of system where to train a good network takes one gpu day you know you can easily train a thousand networks using rl um to get to find one really good one and that can add up to thousands of gpu days because you're doing all these independent networks uh trainings and so at google that may be feasible but for a lot of for a lot of companies who have limited budgets this is very expensive so on the right we have super network-based nas recent work on this includes fbnet uh darts and our own work squeeze nas which i'll talk more about later and super network-based nas um some people call this differential or differentiable architecture search i like to call it super network-based nas instead of having lots and lots of independent networks we create one super network so at each stage of the network we insert all the different modules that we we'd like to be potentially used in the architecture search and then there are different methods of selecting which module to use but i would say a popular one is called the gumball softmax approach and in that basically at the beginning of training all the modules are trained it's just one big network but gradually you know through the use of parameters that are used to select modules to select the best modules based on the back propagation gradients gradually by the end of training end up with uh one sub network in here that's that's deemed uh by the architecture parameters to be to be ideal and um you can do this super network based nas if you want to just purely optimizing to get the best accuracy or what more and more people are doing these days is also optimizing for efficiency so in addition to just back propagating the loss based on your classification error you can also take that loss and for each module in your network that you're searching over add a cost to that of that module to the loss so that cost could be based on a number of floating point operations it could be based on latency it could be based on memory consumption or whatever whatever you deem to be important um and so in that matter super network based nas can be used not only to create very accurate networks but to trade off uh accuracy for efficiency in an automated way and given that you're you're basically building one network and training all this together you know relative to some baseline single network this often adds up to something on the order of two times to 10 times the cost of just training an individual network so these are three innovations that i think have been very um very influential in terms of of getting more efficient networks and so far when we talked about efficient networks we've talked about computations or flops to represent you know how much a network costs but um oh i see a question so it says so chinenkrishna asks for nas most used metrics are flops in latency are there any other metrics you find found to be successful yeah i have seen one or two papers recently and optimizing for memory consumption and the most interesting one of these was where i forgot the name of the paper but is where you look at an embedded device like a microcontroller and they have very hard limits on how much memory how much working memory you can use and so they were using nas to to basically constrain the maximum working set required in in a network um so that you don't go beyond the memory capabilities of of the hardware you're running on but yeah lots of interesting things with memory i think are also possible um and i also am interested personally in um you know given that as i'll show in a moment different processors uh exhibit different latency for a certain network design you may want to have a network that's portable across processors you could chain together you could say well i care a lot about this particular gpu running fast there but i also care a lot about this particular cell phone processor so i'm going to actually add the cost of the gpu training the gpu latency to the cell phone latency to the original um uh original loss and and and sort of jointly optimized for multiple systems but there are probably many other things you could do i i think you could probably think of other things um so so that question plays just into what i'm going to talk about next which is optimizing for flops is is interesting and on certain platforms like say mobile cpus like arm cpus flops do correlate fairly well with with uh latency but that's less true for some other devices um and so to give you an example of this this is this is for uh these two diagrams are from a paper that i really like and actually recommend reading called m bench which came out last year citations at the bottom and these authors set up the several published neural networks ranging from from mobilenet to vgg to others and then they they do inference with the batch size one which is a good setting for for real-time inference on a variety of different hardware devices and a couple of the ones that they do are on the left the movidius compute stick on the right the nvidia 2080 ti gpu and the most interesting thing about this to me is you know some mobile net is is a couple errors of magnitude fewer flops than vgg and um on movidius compute stick you do indeed get a pretty good speed up you know when you go from vgg to mobile mat and uh um you know it's not uh it's not as much as the flop savings it would indicate but it's five times the five dimmer speed up is pretty good um but interestingly on this nvidia gpu vgg g16 is actually faster than the want v2 um despite the fact that gg16 costs a lot more in terms of flops so this motivates a couple things actually one is is that we really think carefully about our implementations of neural networks um you know i think uh there may be ways to implement mobilenet more efficiently on nvidia but i haven't seen a lot of successful work on that yet um but the other thing is is i don't think we're ever going to be able to get away from the problem that um different neural networks run at different speeds on different hardware and those speeds don't necessarily correlate all that well with flops and so sort of putting our our money where our mouth is we've been working on this problem in terms of optimizing latency using architecture search so this is sort of some highlights from a paper called squeeze nas that we did last year and in this case we're optimizing for a small mobile gpu and we're showing three lines on this graph so the green line at the bottom is what happens if we create a search space for semantic segmentation network on cityscapes and we create a search space that's grouped convolutions and dilated convolutions in it the two innovations we talked about before and then we just purely search for the the fewest flops that we can get for accuracy and as you can see the curve is is okay but it's actually worse than mobilenet v3 when you do latency versus versus accuracy or in this case iou comparison but when we instead of searching over flops when we search for accuracy um with the the cost and our our super network based search being latency on this orange line we actually get out ahead of of mobilenet v3 and so optimizing for latency on the target device is really important and in the um in the natural language processing works i'll be showing in a bit we're focused much more on latency than on reducing flops so to summarize what we've covered uh so far so in computer vision in the last five years or so we've seen major reductions sometimes even two orders of magnitude and the compute costs in terms of flops to achieve a result without changing the accuracy um actual speed ups uh in my experience on on good on on the right hardware tend to be maybe more like 10x rather than 100 160x but still we've seen a lot of improvements there and then also you know if you if you freeze the compute cost but are are optimizing to increase the accuracy we've seen double digit improvements in the accuracy on imagenet double digit improvements on the the intersection of our union metric on on segmentation and the same pattern holds for many other uh subfields within computer vision research and just to review the the uh some of the key ingredients there's certainly more but some of the key ingredients that i think have made a really big difference in enabling these results are grouped convolutions dilated convolutions and the the development and and many improvements to neural architecture surge algorithms so now i want to move on part two which is efficient neural networks for natural language processing and we'll start off just explaining at least in from my perspective why efficiency in in deep neural networks for nlp matters um i'll give some background on some of the recent advances and self-attention networks for nlp some people call these transformer networks i i prefer the term self-retention networks and then we'll go into what we did in squeezeburp to improve efficiency and compare it to some of the other other results so first of all why develop mobile nlp and and i guess i'm talking here about mobile nlp but i think some of the same principles should apply to very efficient cloud nlp um so why develop why develop efficient nlp well humans write hundreds of billions of messages every day just on gmail and these few social clients probably probably more across other platforms as well and if you look at where we're actually writing and also reading these messages so much of it's happening on mobile so more than half of emails are written on a cell phone or sorry i read on a cell phone and on facebook which is where a big chunk of these 300 billion messages are are moving around through chat through statuses um close to half of facebook users only ever use facebook from their phone um and so i think natural language processing especially with recent improvements to accuracy has the potential that really help us be more productive in terms of uh more efficiently reading understanding prioritizing and even writing messages um and in a few slides i'll get to some concrete examples of of different things you can do with nlp to facilitate this and um you know if you if you look very broadly at nlp um from my perspective there's kind of two two big areas there's certainly others but two areas that are just really really widely studied right now um and and that have been disrupted by the invention of of self-attention neural networks such as birds or gpt so the first of the two areas i want to to touch on is called natural language generation or nlg so these are tasks like machine translation you know like english to chinese sentence completion where i start writing something and it gets auto-completed as well as generative question answering where i i ask a question and uh through the same mechanism that the autocomplete works uh it the model proposes an answer um and some of the the recent uh you know networks that have become famous for doing this are the original transformer network from google uh published in a paper called attention is all you need um around three years ago there's more recent ones such as transformer xl the gpt family and and the microsoft touring nlg network um so that's language generation then you also have language understanding so um some language understanding tasks so one is um something called extractive question answering which is similar to generative question answering from a user perspective i mean you ask the computer a question it tells you an answer but while generative question answering is just kind of doing a an auto-complete you know based on whatever it's learned in training um uh you know generative question answering you you give the network a question but you also give it a fact set you give it you could give it a pile of textbooks uh you could give it uh you know recent uh recent current events news or whatever and the network will you'll ask the question the network will point to what are the most relevant passages from these source materials so i think general extractive question answering uh while uh it's uh maybe slightly less sexy than like gpt2 or gbt3 where just automatically generates answers from training data the nice thing about extractive question answering is you can continue to feed the network current events um such as you know recent uh recent news articles or recent publications on covet 19 as it advances day by day and extractive question answering can help you with that without retraining the whole model so that's pretty cool and then the kind of bread and butter task in nlu is text classification so basically um you give it a sentence or or maybe even a book and it'll tell you based on some sort of categories uh what what it sees similar to image classification and computer vision and some of the popular models here are the original gpt focused quite a bit on nlu uh bert albert and then also there are some things like roberta and elektra some people will say roberta or elektra are are model architectures but in reality roberta roberta takes the bert architecture uh just off the shelf and the electra uh does very minor modifications to to bert just in the first layer and the real innovations in burt and elektra are in the training so in roberta they add much larger amounts of training data and they also train longer and they do a few other things in the electra they're doing a discriminator generator approach uh somewhat similar or inspired by a gan or generative adversarial network and that helps accuracy quite a bit so anyway these are these are areas where you know everything on this page you know the state of the art in these areas on many of the benchmarks at least is is now based on self-attention networks such as the ones i'm listing here so one of the canonical self-attention networks is burt and there are a few different flavors of it really large ones smaller ones one is called burt base and um you know one thing i want to start out by looking at you know under the heading of efficiency is well how fast does bert bass run on a smartphone you know if we want you know as we we we type things as we receive emails to just be immediately you know to get the results of burt inference uh at our fingertips how fast that really happen and right now um you know i took bert bass and pie torch exported it to a self-contained pie torch representation called torch script put that on the google pixel 3 smartphone that took 1.7 seconds to do a length 128 sequence um so that's you take a sentence you you uh tokenize it so you convert the words into integer values based on the vocabulary and if you have 128 tokens it takes 1.7 seconds you might ask you know is is pi torch uh the best or others are there's faster or what so i i looked up some results doing the same experiment on the same phone with tensorflow and um and what i saw was was that bert base takes about 1.5 seconds there so very similar so this seems to be be the reality of what we're faced with when running uh neural networks for nlp on a smartphone it's it's it's pretty slow um so now let me talk about where how bert module works and then i'll get to where the time is going and then we'll work on optimizing the module design so this uh there are many diagrams out there for showing the burp modules um what i decided to show here is a breakdown of all the layers and how they feed into each other as well as what are the sizes of input and output so this this spur module has these these three fully connected layers uh called qk and v and the original attention is all you need paper q k and v stood for query key and value and there are some things with passing results from early layers into later layers as as one of the qk or v's but nowadays i don't know if query key and value really is the best analogy anymore i just look at these as three fully connected layers it all just just process the input data and then where the attention actually happens in this module is in these two matrix multiplications so you take the q and k layer outputs you do a bit of transposing and reshaping and then you multiply them together and the reason this detention is well you know most um neural networks that i've seen the the the thing that the fully connected layer convolution layer does is it multiplies weights by data right a tension multiplies data by data so it takes the output of two layers and actually multiplies those together and doing that has many interesting properties such as in a you know based on the input data you can re-rank which are the most important channels to be looking at for instance there are also ways that you can think about attention as connecting you know different different elements of a sequence so this is um uh this is the self-attention piece and it uses this uh qk transpose uh divided by uh constant soft max that and multiply it by by the v tensor so that's the tension and then the full burp module also has these three fully connected layers at the end often called feed forward network layers or ffns so we have four of these at the end and yeah that's a full burp module a bert based network first has an embedding layer which takes the tokenized tokenized sentence and um uh outputs a vector for each potential token and then the whole rest of the network until the final classification basically is just 12 of these modules that you see on the screen here stacked up one after another so if we look at where the time is going in this particularly on our pixel smartphone that we're using what i found was that 88 of the overall latency of this uh this network is from the fully connected layers um so the attention the map moles that we talked about those are actually pretty cheap both in terms of flops and in terms of latency but these fully connected layers are are quite expensive and so you know given um you know my experience working in computer vision for some years you know when i see a really expensive layer one of the things i think about as well um you know could i could i make it into group convolution and i'll get to that in a minute but basically the first thing is we have to to figure out well is there a way to rephrase these uh fully connected or sometimes people call them position wise fully connected layers as convolutions so that we can have the full range of convolutional capabilities at our disposal um by the way i see you have a backlog of questions i will i will catch up with those in just a minute so um the two equations on this slide one for fuller connected one for convolution they're very very similar and in fact the only difference is this k uh factor here on the bottom right which is is to to account for the kernel size and having to reach uh reach uh out out of your your immediate uh point on the on the tensor to to encapsulate your whole kernel size the thing is if your kernel size is one then these two equations actually are equivalent because that k just becomes one um it becomes zero so um we go into this more in the in the squeeze vert paper but but this basically is our our uh proof that that um what we've been doing all this time invert actually as a form of convolution and so in the new model that we're proposing that we call squeezebert it's very very similar to burt we've changed very little about the model of relative divert we've we've got the same embedding layer we've got the same final classifier we've got the same training mechanism um but what we what we've changed is for um out of the six fully connected layers in the the burp module for five of them uh for qk and v and then for two out of the three ffns we have set them to have groups equals four instead of groups equals one so this as we will see in a moment can save us quite a bit of uh computation but before i go into the evaluation let me take a look at the questions and catch up on that right so one question a while ago was should we measure flops or energy flop seems like a complete metric yeah energy is a really good one too it's actually hard to measure the energy on a smartphone because um it has a battery and you know you need to kind of take it apart um to get to the individual you know leads that you'd want to measure but that's something worth doing for sure um and then um someone said i don't see many nast papers nowadays has that reached some plateau i don't think so i think people are branching into new applications um uh but but i i think um and there is this paper called fbnet v2 where they they go even further than i've seen before in terms of weight sharing across uh an amortizing computation across different uh layers in your search um so so yeah there and there are more questions i i will address the rest at the end but i just didn't want to keep you waiting so okay that's the squeezeburp module now let's talk about how well that really works so we're going to evaluate squeezebert relative to bert and and and another network or two on uh something called glue which stands for general language understanding evaluation this is a natural language understanding benchmark set it's primarily focused on text classification tasks and just to give you a brief summary of what are the tasks in here you can certainly check out the glue paper and find more but you know just a bit of information so there's one task on sentiment analysis which i think could be really interesting for flagging you know what are the unhappy customer complaints or reviews or help desk messages or things like that and trying to bring them to someone's attention to to address um there are several similar but not quite identical tasks on taking a pair of sentences as input and classifying whether they're similar have similar meaning and there's um six tasks like this in the the glue evaluation um and these you know nlp expert will tell you that each of these tasks is slightly different for instance some of them are just saying you know given two sentences do they mean the same thing others are saying given sentence one does it imply that sentence two is true so it's there's subtle differences which is why we have so many benchmarks on this but it's basically sentence pair matching is the basic idea and then there's uh one that that checks to see you know given a question answer pair uh you know does the the answer at least from a grammatical perspective answer the question um so that that would be pretty useful i think if you're doing something like you know a stack exchange or or uh an issue tracker or a help desk where you've got this backlog of of of uh email threads or or forum discussions and you can go which ones do i do i need to not worry about anymore because they seem to have been addressed and finally there's there's one uh task called cola which you give it a a sentence or sequence and it tells you whether it thinks the model's supposed to tell you whether the sequence is grammatically correct so um that could be useful you know going beyond the current capabilities and grammar check and spell check as you write text so anyway that's what's in glue and now let me show some results so first let me caveat this by saying in the paper we we read every conceivable paper we could find on efficient neural networks for for glue and cited them and put a big results table in there with tiny bert and distilbert and all these things but but for this i just wanted to keep it simple and show bert base uh mobile bert which i think is is the best of the related work that i'm aware of for for this problem around this level of accuracy and then finally squeeze part so what's interesting here is um and these are all the the latency are measured by me but the mobile bert latency i measured actually is slightly faster than the latency that the authors reported so i took the faster number um uh so anyway um what we find is motherboard has a bit fewer flops than than uh squeezebert but then squeezeboard actually is more more uh efficient in terms of latency in terms of real real speed up so this is interesting it tells you yet again that not all flops are created equal uh depending on the device and depending on the computation uh you know the the cost per flop may be higher or lower in terms of latency but um basically you know these are all around the same glue score um you know i we're doing some work to potentially further improve the results of squeezebirds but i would say that glue results in the last couple of years or maybe 18 months since bert have continued to improve they're now in the high 80s low 90s so plus or minus 10 to the percentage point here is not that interesting and what would be much more interesting would be to take um not just just squeezebert but groups convolutions and other other um bright ideas from from computer vision on their fields and begin to apply them more broadly to nlp so so to summarize all this i think computer vision research has made a lot of progress recently lots of gains in efficiency and accuracy self-attention networks otherwise known as transformer networks uh have have been a big breakthrough for nlp accuracy but at the cost of more more computation required and so squeezebert showed that group convolutions which are widely used in efficient computer vision can accelerate the recent network designs for for nlp and one thing i'm very interested to do in the future is to unify more of this so i already have a initial implementation of neural architecture search for nlp and i'm experimenting with that now and i'd like to get to a point where given almost any platform which could be different types of instances in the cloud could be smartphones could be could be tiny ml devices uh or what have you you know we can we can do something that's fairly automated we're given a data set and um given the hardware platform we can develop something that's that's uh that's ideal um and i've mostly talked about neural net design but there are also other um complementary things like specification and quantization that i'd like to roll into this as well so okay that's the end of the talk and i see i have lots of questions um so do you want to help me pick questions or should i just go through them yep absolutely so yeah thank you forest for really uh such an engaging talk and uh walking all of us through uh the journey of adopting uh vision modules as well right so convolutional models and i think this space is is a lot uh larger than maybe swapping out fully connected layers with convolutions right so there's there's there's a whole uh domain of uh computer vision research that we can we can quickly adopt and uh just just try out different things as you've uh indicated in your future work so um yep thanks again for the engaging talk in the q and a i'm mostly filtering based on the likes we've received and i think let's start with uh with stephen young's question uh because you that was the last remark you made so what he asks is what about the other techniques such as quantized network you touched on that um network pruning like the hypothesis ticket paper uh distillation um does pruning give worse models and network architecture search and and so on um so maybe you can go into a bit more details on this good i'm just trying to pull up the question myself so i can make sure i don't forget anything about it oh yeah i see it okay great um yeah there's a lot to unpack here it sounds like you know you you probably have you hopefully you're already doing some of this because it sounds like you've really thought about this which is great um so um let's see where to start so the first thing he brings up is um you know quantization so for training um as well as sorry yeah for training we actually used um 16-bit for for pre-training um particular and on mobile devices um i'm very interested to integrate a q and n pack or something similar or sorry x and n pack or something similar into uh the code base i'm working with for mobile um which you know just in the cvpr paper this year uh x and unpack was was shown to to provide superior results in in sparse networks and so sparsification has been something where um it's been difficult to get speed ups um uh even on on mobile but much less on gpus and i think people are getting getting more creative with that so that's great and in terms of integer um there's this paper called uh qbert which came from some good friends of mine at uc berkeley and in cubert um i think they do a few things right so they're doing q bert stands for quantized bert and i think so much of the quantization community has focused on the um uh problem of purely how small can i make the network to store it on disk which has value i think if you're transmitting an app update or something that's useful um so forth but uh qbert and and uh um uh and there's other work called hawk hawq from the same people um they are extremely focused on speed and so they're getting um you know significant speed ups in their published work even better in unpublished work that i believe is coming soon um for vision but also for for burt and transformer networks and nlp um so i'm very interested to integrate that and i think they are um they seem to be on on the cusp of of really demonstrating the full stack uh all the way down to eight bit or four-bit implementations uh that don't do the the questionable thing of of accumulating in 32-bit or something which on many hardware platforms slows things down so anyway that's a bit of a ramble but but i think yeah there are lots of opportunities in in quantization and um pruning or or sparsification that i'm i'm very interested to adopt myself and maybe improve myself and then you mentioned the lottery ticket hypothesis paper so yeah i think my interpretation lottery ticket hypothesis is you you you know as as you train you you try to specify right um and there's sophisticated things with with rolling back a few iterations and different things like that um but basically it's it's a a sparsification technique that that um endeavors to to get superior accuracy versus model size trade-offs um but i think to fully exploit the improvements in sparsity whether or not it's based on laundry ticket hypothesis or other approaches we need to continue to improve implementations of sparse neural networks um and uh yeah i i'm probably going on too long about this but i want to make one last comment which is steven you also sort of talk about you know model uh um network pruning and neural architecture search and model design they're all different sides the same problem uh where you could imagine you start with an infinitely large network and then figure out well get rid of most of it what should be left so i found that that i've never seen better results than when i first started out with a highly efficient network uh which may or may not be designed from architecture search and then try to prune it and quantize it i think um uh if you try to take a really large network and just prune it um or quantize it without thinking about better network design you're usually leaving a lot on the table with my experience all right so do you want to go to the next question yep awesome thank you for so here's an interesting question and this is a question i had as well is how do you characterize uh training efficiency versus inference efficiency the specific question by guru ea is does the same speed up number 4x faster on evaluation also hold for training i imagine not uh especially if you're training with gpus uh using group convolutions but uh but that this is a question that i have as well yeah i'm so glad you asked yeah um i should have explained this uh better in the talk i i i didn't for some reason but yeah that that's your intuition is correct that on you know gpus uh at least with the current implementations and the current gpu designs really struggle with speedups and group convolutions in vision and they also struggle with group convolution speedups and nlp so you know we i think we might have seen you know a very small speed up in training going to groups you know five or ten percent or something but but not not you know 4x um if you were training on a mobile device which actually could be be something worth doing if you look at um you know the interest in self-supervised learning and privacy preserving neural networks i think there you would see this kind of speed up um but i think for for gpus we may need some some different different approaches if we want to see a big speed up there but encouragingly i i've been looking into uh the graphcore um uh competitors uh hardware to to nvidia and graphcore themselves are reported um you know for for it for a regular you know resnet or or bert network you know they have a similar compute uh versus um [Music] uh you know energy cost but for groups um you know uh graphcore is getting big speedups and and big energy savings uh we're in videos not so i i think there definitely are our novel things emerging that could make even the approach that we've shown today much more efficient in training awesome thank you and and related to that i think is is another interesting topic which is and again open-ended is a question by casey tung which is do you think non-gradient descent optimization techniques have some potential i don't know um i mean it depends how far you want to zoom out right so i think in terms of uh the immediate uh applications of you know let's try to classify some sentences or classify some pictures or or generate some some you know media like that um i haven't yet seen anything that's just totally blown uh fancy versions of gradient descent out of the water um you know i think a lot of the work that's that's described the second order method still is is uh is uh heavily influenced by by you know just traditional gradient descent uh you know with some hessians computed and things um not you know it's good work and that's helped a bit but um you know nothing that requires you to kind of start over in your optimizer design so basically thinking myopically for the next couple of years or taking the tools we have and trying to rearrange them um if by the way if you have like good ideas for uh other optimizers i'm super interested um but but i'm not personally aware of of uh of some of that interesting of directions there however you know i think you know on the more grand quest towards you know how would we create general intelligence or you know how do we you know really get some common sense baked into machine learning models these kinds of questions i do think some entirely different thing that that may not have you know training data and traditional sense or stochastic gradient descent nutritional sense may be necessary i just don't know what that thing is going to be yup awesome yeah that sounds great um i there's an interesting question by elia zarkov on efforts replacing soft attention in in transformers with hard attention for speed up and what are your thoughts on that so uh i actually don't know the definition of heart and top detention maybe someone could enlighten me i'm assuming i'm i'm not quite familiar either but my assumption is some notion of binarization of some sorts or uh like a gated tension as opposed to like the soft max or something along those lines oh i see um yeah i guess i'm i'm not super familiar with this with this line of inquiry but i guess my my blanket answer would be i don't think there's any reason to believe the current attention layers people are using are optimal i think i think um there probably will be big changes in the next few years based on some breakthroughs so yeah i mean basically i think computer vision to the question that was answered asked before uh about is neural architecture search hitting a plateau i think the problem actually is more computer vision isn't advancing as quickly as it was a few years ago and so therefore any activity in mainstream computer vision tasks including doing architecture search and doesn't generate as big of improvements as they used to um but i think um in nlp it feels so much like computer vision um five or even even eight or nine years ago um when when the approaches were so fresh that it was like you could try anything and there was some decent chance uh that you'd stumble on some breakthrough or some some improvement and nlp feels the same way it's like you know generate a list of 10 ideas of of what to try and uh try all of them and i bet two or three of them actually will produce surprising you know results uh maybe better accuracy or maybe some other dimension of breakthrough so yeah definitely play around with attention uh it's probably probably the better things that we could do than the current approach cool thank you for that forest um there is this uh there is one question specific to the talk so i do wanted to address that which is um on squeeze bird i believe the question is why leave ffn one with g equals one so why is the um i think if you actually just go back to your slides yeah what what's the reason for the feed forward at the first layer glad you asked so these um feed forward network layers if you look very carefully you'll see that the input and output to f of the first feedforward network layer the number of channels is c which in the case of bert base or squeezeboard is 768. the second feedforward network layer uh has uh 768 input channels but four times the number of output channels so like 3072 output channels and so uh fn2 costs four times as much to compute in terms of flops as fn1 and then finally fn3 um has to ingest uh you know 372 for you know four times c uh channel inputs then outputs just the regular 768. so basically fn two and three uh before squeezeberg came along cost uh each cost four times as much as fn one so we're kind of balancing it out where we say okay now at least terms of flops these three fn layers all cost the same and and then also why have uh groups equals one anywhere in the network we want to ensure that that the network has uh you know adequate ability to communicate uh across all the channels um because otherwise you can have something that starts to act more like for independent networks um and then um you know an alternative to having groups equals well we could stick with groups equals four um but we could do uh like a shuffle operation like was proposed in shuffle net um to to reorder the channels so that uh the channels mixed nicely so that would be an alternative awesome so let's uh let's end this with another uh open-ended question uh which is uh basically uh how much of the work on efficiency is motivated by climate change and the increasing amount of energy that's necessary to use to power advanced and complex models and um just another note is that uh forest has agreed to stay a bit longer as well so we can continue with the questions keep them coming for another uh 10 minutes or so but those who who do have a hard stop at three i just thought we should end this end with this question excellent so how important is climate change in all this i think very important um you know the numbers i have seen indicate that uh between two and three percent of greenhouse gas greenhouse gas emissions in the united states already are from data centers and that's growing um and that's that's um you know that's just data centers of course there there are lots of other um costs to putting up lots of hardware including the energy and the water required to make computer chips and and and you know you can you can look at very large environmental impacts and so um i absolutely think that that every aspect of deep learning is something uh where especially when the computational costs start to get out of hand as we've seen i i thought the original bert was already pretty expensive but gpt3 is just is just really really expensive um we have to figure out you know if we want these capabilities um at our disposal we both from a cost perspective and from an environmental responsibility perspective we can't continue to do it by by boiling the ocean in terms of our compute resources so that that's a that's a big deal um to me um yeah yeah that makes a lot of sense to me um and uh another question that that came up with with respect to uh again more of the the nuances right is um are you familiar with song hans work on hardware aware uh transformers and uh if so how does this compare to squeeze bird great question so yeah songhan is a is a good friend of mine he was actually one of the co-authors of the original squeeze net paper so we've known each other for a long time so in the squeezebert paper i i uh we do touch on um song hans work i i believe it's it's called light transformer i want to say um and anyway so that work um if i remember correctly the differences between that and squeezebert are first of all it's doing language generation rather than language understanding um at least the paper i'm thinking of i think it was published in iclr last year um and so it's more focused i think on machine translation or um language modeling which basically means automatically generating the next token or the next uh sequence uh or or some combination of those so different application um and then the other thing is as they uh if i remember correctly they used convolutions but they they basically said we're gonna um we're gonna have attention layers and we're gonna have convolution layers and those will just be separate things in different parts of the network and um uh in our case the the attention networks now are actually convolutions so uh it's a bit different um but uh but yeah i i was certainly inspired by by that work to work on this awesome and along the same lines here's a question from renmoy which is for memory or computations do we penalize individual modules separately so if yes how is that channelized to individual modules using gradient descent sorry could you i didn't entirely uh follow um i i think uh i think the question is how do you uh my understanding of it is how do you incorporate the uh the cost it's basically what's the cost function and how do you incorporate certain costs into the loss function based on what the module is is my interpretation of this oh great question yeah um i i should do a tutorial on that sometime i i think um i have not seen a good like a good step by step here's how you do super network based search starting from scratch so that's a really good question hopefully i can at least answer it a bit for this audience so um uh or or you know for for off the cuff explain a little bit so um basically you know you can imagine uh each layer during back propagation you know you have um uh cost right and so um in pi torch you you propagate um you know uh each set of gradients from each layer to the previous one and so forth but you also sum up the cost across all the layers and so basically when you're and the cost i think is basically the sum of all the gradients and layers um and that indicates to the auto differentiation method that these costs all all should be taken in account uh back prompt so this is basically you know you you write this this summer some of the costs and then that tells the autodiff what you want to do so what we do is when summing up the costs for each layer where we're doing multiple choices we also add in the cost of the flops or the memory or in our case the latency and then we also multiply the cost of each layer by a hyperparameter which you just call the cost multiplier so um you know what you really want is you know if your typical loss is you know 0.1 uh you might want that cost to be a couple orders magnitude less like point zero zero one or something like that so you you wanna kind of um uh choose a cost so that's even towards the end of training when the loss is quite small that's the um the additional cost that you've added is not dominating the the gradient because if it does uh then it can kind of drown out the actual uh training that's required um does that make sense like i guess we won't hear back but it does to me so yeah thank you for the explanation um let's wrap up with another uh open-ended question and then for us feel free to look through the questions uh as well if there's anything that i missed or if there's something that you think requires a priority but one of the questions that i saw is what models are suitable for privacy preserving applications and that's to that's a it's an open-ended question and i'm curious to hear your thoughts as well yeah i i don't want to pretend to know more than i do about this but my intuition is that the overall system architecture of privacy preserving uh business applications and consumer applications is still being figured out and in terms of privacy per serving uh it can be um you know the most extreme case would be that no form of data whatsoever comes back from the user device to the company that created the the application right and in that case you really need i think your only option really is on device training so that that limits um the scope to things where you can either convince the user to label stuff uh or you know based on the user's future actions you can find out very quickly which of the potential things that the model could have output is actually the correct one so you look at autocomplete on device training for an autocomplete system is actually quite feasible because um you know you the user eventually types what they mean in most cases and then you can say that was the correct answer um so for those sorts of things you basically want all the same things that you would would aim for in an efficient uh model that can be done you know efficiently for inference on the device it's got to be very very computationally efficient it's the same deal there and then there's also things like federated learning which at least one instance of that is where the engineers who develop the applications you know what the company is making the app get some uh some data back from the user but it's abstracted in some way so for instance you might get uh gradients back from the user uh to train on but not not the raw data and again that has the issue of well the user will have had to provide the labels in some fashion then which which narrows the applicability a bit um uh and then um yeah yeah you know i think there are a lot of other things beyond that that that can be done with with training in the wild but um uh yeah hopefully that's the start that sounds great so um again thank you everybody i have one more i wanted to answer all right sorry uh elio zarkov got back to us about uh more about his meaning of replacing soft attention with heart attention i think for intentionally or unintentionally it appears to be a private message so i don't know if you've seen this but he said what he meant was um you know what would it be like to replace the the soft max and attention with an actual ardmax and i think basically um you know argmax can have some difficult properties with regard to it being differentiable um but if you can resolve that then then argmax probably would work pretty well yep yep and i guess there are things like smooth max and so android where you can which is which is not too different from a soft mac so it's interesting how you can try to have a differential component i also believe he might uh mean it on the inference side of things so uh again if you um i don't know if you have the time to clarify but basically at inference can we just swap it out with it's just an arc max that i believe might be um the question as well so you you can train with the differential equivalent but then maybe at inference you swap it out with um with just the arg max it could work um if you if you try it i'm super interested to hear what the results are yeah awesome well thank you forrest thank you for your time thank you for such and such an elaborate talk and an engaging talk really and uh um and and the future work especially opens up uh uh an entire research domain in my opinion where we can borrow techniques not just from computer vision into nlp right but there are also uh recent trends where techniques from natural language processing such as attention mechanisms and sort of the transformer architecture being borrowed um into computer vision so hopefully there's there's some sweet spot where we bring in efficiency and hardware and and maybe the economics of it all um at the end of the day um normally we would applaud uh in a lecture room but but thanks again this is a virtual applause for for for the um for such a um elegant talk uh in my opinion uh i'd also like to thank stephen and henry who are production specialists and really uh this team's live event wouldn't have been possible without their uh their support and commitment right and uh and these are interesting times with with covert 19 um and and teams live and microsoft as a whole has made these kinds of talks happen i saw that there over there hundreds of people who attended this talk and so in some sense the talks continue and and uh they're accessible uh maybe more than ever so thanks again uh forrest stephen and henry thank you and thanks for your attention

Original Description

Deep neural networks have been trained to interpret images and text at increasingly high levels of accuracy. In many cases, these accuracy improvements are the result of developing increasingly large and computationally-intensive neural network models. These models tend to incur high latency during inference, especially when deployed on smartphones and edge-devices. In this talk, we present two lines of work that focus on mitigating the high cost of neural network inference on edge-devices. First, we review the last four years of progress in the computer vision (CV) community towards developing efficient neural networks for edge-devices, ranging from early work such as SqueezeNet, to recent work leveraging neural architecture search. Second, we present SqueezeBERT, a mobile-optimized neural network design for natural language processing (NLP) that draws on ideas from efficient CV network design. SqueezeBERT achieves a 4.3x speedup over BERT-base on a Pixel 3 smartphone. Finally, we believe that SqueezeBERT is just the beginning of several years of fruitful research in the NLP community to develop efficient neural architectures. See more at https://www.microsoft.com/en-us/research/video/from-squeezenet-to-squeezebert-developing-efficient-deep-neural-networks/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 22 of 60

← Previous Next →

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research

Frontiers in Machine Learning: Climate Impact of Machine Learning

Frontiers in Machine Learning: Climate Impact of Machine Learning

Microsoft Research

Frontiers in Machine Learning: Security and Machine Learning

Frontiers in Machine Learning: Security and Machine Learning

Microsoft Research

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Microsoft Research

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Microsoft Research

Remote Work and Well-Being

Remote Work and Well-Being

Microsoft Research

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Microsoft Research

Towards a Practical Virtual Office for Mobile Knowledge Workers

Towards a Practical Virtual Office for Mobile Knowledge Workers

Microsoft Research

Impact of COVID-19 crisis on the future of work in India

Impact of COVID-19 crisis on the future of work in India

Microsoft Research

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Microsoft Research

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

Microsoft Research

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Microsoft Research

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Microsoft Research

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Research

Directions in ML: Algorithmic foundations of neural architecture search

Directions in ML: Algorithmic foundations of neural architecture search

Microsoft Research

MineRL Competition 2020

MineRL Competition 2020

Microsoft Research

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Microsoft Research

From Paper to Product

From Paper to Product

Microsoft Research

SkinnerDB: Regret Bounded Query Evaluation using RL

SkinnerDB: Regret Bounded Query Evaluation using RL

Microsoft Research

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research

Programming with Proofs for High-assurance Software

Programming with Proofs for High-assurance Software

Microsoft Research

Platform for Situated Intelligence Overview

Platform for Situated Intelligence Overview

Microsoft Research

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Microsoft Research

Galactic Bell Star Music Demo

Galactic Bell Star Music Demo

Microsoft Research

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Microsoft Research

Welcome to Microsoft Expressive Pixels (1 of 9)

Welcome to Microsoft Expressive Pixels (1 of 9)

Microsoft Research

Getting Started with Microsoft Expressive Pixels (2 of 9)

Getting Started with Microsoft Expressive Pixels (2 of 9)

Microsoft Research

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Microsoft Research

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Microsoft Research

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Microsoft Research

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Microsoft Research

Using Layers in Microsoft Expressive Pixels (7 of 9)

Using Layers in Microsoft Expressive Pixels (7 of 9)

Microsoft Research

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

Microsoft Research

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Microsoft Research

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Microsoft Research

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Microsoft Research

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Microsoft Research

Novel Image Captioning

Novel Image Captioning

Microsoft Research

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research

How does holographic storage work?

How does holographic storage work?

Microsoft Research

The physics of hologram formation in iron doped lithium niobate

The physics of hologram formation in iron doped lithium niobate

Microsoft Research

Introduction to coax: A Modular RL Package

Introduction to coax: A Modular RL Package

Microsoft Research

Directions in ML: "Neural architecture search: Coming of age"

Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research

Clinical Research with FHIR

Clinical Research with FHIR

Microsoft Research

Soundscape Street Preview

Soundscape Street Preview

Microsoft Research

Tilt-Responsive Techniques for Digital Drawing Boards

Tilt-Responsive Techniques for Digital Drawing Boards

Microsoft Research

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

Microsoft Research

Haptic PIVOT: On-Demand Handhelds in VR

Haptic PIVOT: On-Demand Handhelds in VR

Microsoft Research

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

Microsoft Research

More on: Neural Network Basics

View skill →

How to Use Tensorflow for Classification (LIVE)

How to Use Tensorflow for Classification (LIVE)

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

How to Make a Neural Network (LIVE)

How to Make a Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

Identify Horses or Humans with TensorFlow and Vertex AI

Understanding AI from Scratch – Neural Networks Course

Understanding AI from Scratch – Neural Networks Course

freeCodeCamp.org

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling