Deep Learning: Modular in Theory, Inflexible in Practice with Diogo Almeida - #8

The TWIML AI Podcast with Sam Charrington · Beginner ·📐 ML Fundamentals ·9y ago

Skills: ML Maths Basics90%Supervised Learning80%ML Pipelines70%CV Basics60%

Key Takeaways

Diogo Almeida discusses the challenges of deep learning, including its modularity in theory but inflexibility in practice, and the importance of understanding data, software, and optimization issues. He shares his experiences with machine learning competitions, such as the Kaggle Cause Effect Paris Challenge, and the use of techniques like boost decision trees and ensemble methods.

Full Transcript

[Music] hello and welcome to another episode of twiml talk the podcast where I interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam charington the recording you're about to hear is part of a series of interviews I recorded live from the old Riley Ai and strata conferences in New York City last month I'll be sharing these interviews on the podcast over the next several weeks and I'm sure you'll enjoy them this time I interview Diogo Almeida senior data scientist at Healthcare startup and litic Dogo and I met at the AI conference where we delivered a great presentation on in the trenches deep learning titled deep learning modular in theory inflexible in practice Dogo and I discussed the ideas he presented which are centered on the data software optimization and understanding issues surrounding deep learning Dogo is also a past first place kaggle Competition winner and we spent some time discussing the competition he competed in and the approach he took to win it before we jump in a bit of a listener warning our conversation gets pretty technical pretty quickly I do try to make sure to summarize key points from time to time and I really think that if you hang in there I'm sure you'll learn a ton of course let me know how you like this level of detail I'll be including links to Diogo and a bunch of the data sets and other things that we discuss in the show notes which you can find at twiml ai.com talk8 also as is the case with my other field recordings there's unfortunately a bit of unavoidable background noise sorry for that and now on to the show all right hey everyone I'm here at the O'Reilly AI conference and I'm sitting with Dogo Almeida who just did a really interesting talk on deep learning and he was kind enough to sit down with us and talk a little talk a little bit about what he talked about uh Dogo wanted you introduce yourself cool um I'm Diego almea I work at this super cool medical deep learning startup where we work on giv like really accurate really fast really safe medical diagnosis and this is something we hope will completely change the world um before that and a past life I was a mathlete so I broke a 13-year losing streak for the Philippines in the international math Olympiad was uh in the top team in the world at the interdisciplinary competition in modeling and there's a website for machine learning competitions called kaggle that I won first place on in one competition as well when was that this was in 20 3 was the cause effect Paris challenge uh tell us about that oh it was a very weird challenge where in most machine learning you have like tabular data right so you know like you have Columns of features rows of observations and in this problem your data was pairs of sequences so you have something like altitude and like one ex one observation is like altitude and height and you have like a pair of uh sorry a sequence of pairs of like which altitudes correspond to which heist in some unordered Manner and the idea was given this you're supposed to predict whether altitude um is causes height or height sorry the AL height were the same thing I meant altitude and temperature right so you were supposed to predict if altitude causes temperature temperature causes altitude it's obviously that altitud c is temperature right for us uhhuh but there's a lot of like very uh complicated tasks that we don't know the answer to and it's kind of like the basic task is to if you know the saying correlation doesn't imply causation right it's supposed to do the opposite of that so you're supposed to figure out how the correlation implies causation okay which is it's extremely useful because you have like lots and lots of observational data it's very hard to have like a controlled study so the more accurate we can get a view of the world from purely observational data the more we can um either have informed priors before running the controlled study or figure out how to order the controlled study in an appropriate way okay and is this also the kind of analysis you would use for um like a root cause anal analysis or something in like an iot use case where you've got all these observations and you're trying to figure out what the the underlying condition is or I'm not as familiar with that um there are there there was um traditional statistical work and there was actually was a background for this topic but I kind of didn't pay much attention to that CU I kind of went my own way and it was much more for fun than for winning and winning was a very nice side effect nice um and I I I went through a much more like software oriented way of just like build a really complicated powerful model and have it solve this based on like rather than like hand engineering stuff why not just like automatically engineer a lot of informative variables and then solve it with that okay so can you walk us through the process like how do you how did you uh formulate a methodology for attacking them was this your first cgo competition or had you been doing my Ser my first serious one I done like one or two before that I didn't really like really spend much time on but like you know you quit like after 2 days because it turns out your teammates Wen useful or something like that so I have like played with it before but I've never really gotten all out until this one so my methodology was well some background is that there are like statistical tests that people use um that um did very well in this task okay and um or sorry that people used to use on this task and put it roughly in perspective these got like 6ish Au so if you see a paper in nature science about a new test for causality probably gets around 6ish Au okay Au for those that don't know is area under the curve and that's a performance metric yeah so we were um it we were solving a ranking problem or we were trying to rank the outputs given that we know which ones were uh which ones caused each other it's a little bit of complicated metric because we actually had three output classes so we did like a bir directional au but that doesn't really matter much and so these tests we had like6 Au they're roughly a single feature cuz it's just the prediction you extract it directly from the data MH um the most of the other competitors in like the top 10 had you know tens of features or something like that and the second Placer I think had like a whooping like 100 something features okay and I had 50,000 so okay um so what I did was I found like a very simple um way of determining causality which would be um the rational would be um if if x causes y then um Y is a function of X you know there's noise in there somewhere right so roughly you can tell how good one is a function of the other based on how well they can be approximated by functions and this is kind of like a very vague like recipe for how to create these features but the idea is rather than you know hard coding statistical tasks like you know like add aian integrate this thing out whatever I just figure that we have an entire field of curve fitting which is called machine learning right and these are often like built after natural like very natural priors so the idea would be try like a ton of machine learning algorithms all of the ones that were computationally feasible try a different metrics for what fit means because fit is it's it's kind of like a not like a very exact term and like throw like these are all of the features now throw them all into like big uh boost decision tree um train this thing for a week on like a 50 core machine and then you know take a nap the entire time so that was roughly my solution wow um and so the solution was was primarily based around the boosted decision tree as opposed to some super complex Ensemble or something like that Yes actually um it's a weird story that for this competition I was so far ahead for almost all of the competition I didn't even try okay so um the what was it like for basically everything beyond the last week like yep like maybe a month or a month and a half before that I even started the competition late I was like so far ahead that the gap between like me and second place was like the equivalent of like you know second and like 15th or something like that oh wow so I was like feeling really confident and I actually stopped paying attention to this cuz I felt that like oh this is going to be easy right but then during the last week you know someone you know people started sharing their Solutions like I only got 10th or something here are the features I used and all of a sudden like everyone started Rising okay and this is definitely basically by creating ensembles of everyone's everyone else's Solutions um like people like kind of hinted at what I think it was only one person but like they had like a lot of good stuff in there that other people started using and once people are getting performance they like make more of it or something like that so um people are starting to rise right and like I didn't haven't even ensembled this far and I unfortunately had a model that took like a week to train like I said yeah so and I only had one week left for the competition so I decided that um I I tried like a few last minute attempts at ensembling uhhuh but nothing beat my like my super big one week long model okay and so I just stuck with that thing and that ended up actually winning and it actually was very scary because people ended up passing me on the the training on the the validation leader board but then test leaderboard was like was completely flipped because by they overfit yeah they oh like they had like hundreds of submissions while like my best submission was like my sub 10th cuz like it was like it was a very like hands-off competition for me I cared about it a lot and I like I wrote like lots of software that was um I thought nice um but like I was really I I really really thought that would have been like an absolute slim dunk okay so it's exciting though okay so where how where did the 50,000 features come from um so uh you can imagine like exponential growth when you're just trying like every com com of this with every combination of this um there was like every combination of metric that I can think of every combination of machine learning algorithm that was like computationally tractable there was like symmetric features so you could like augment your thing with like difference features because like it doesn't matter which X or Y is right um there was a a nuanced thing that I don't normally explain when I talk about the competition which is not all of the input was um numerical some of it was categorical Okay and like you just can't like throw categorical data into a numerical algorithm right so it becomes actually a complicated problem of how do you compare numeric different ways of calibrating your bins or something like that well I mean you can it's very easy to convert numerical to categorical but you lose a lot of information from so what I did was I did different ways of converting from like like this is like a categorical numerical pair metric so this stuff like compare you know compute uh sorry convert numerical to categorical via like clustering or binning or something and then you know when you want to convert categorical to mirical you do something like the PCA you know like get the first um uh principal component or something like that or projection to the first principal component and I basically are just looping through all of these things so you can imagine like a lot of L in four Loops okay in the end I had a bunch of them so like that ended up with like 50,000 is and I also skipped a detail there which is I also used a feature selection algorithm um in order to like make it a little bit smaller which helped performance a bit but in ended up not being important so I usually admit but for the sake of clarity that was also done okay okay wow that that sounds pretty cool and now uh that was a little bit of a digression I guess complete digression yeah um interesting story though yeah absolutely absolutely actually generalized to new problems as well I believe the competition organizer was applying it to some sort of biology problems and they were showing that theyd actually predict causality on that as well oh really so um yeah hopefully that kind of thing could be really useful oh nice nice um but what you were talking about here was deep learning Y and yeah that was not deep at all and I didn't catch uh I didn't catch all of your talk I caught the last uh bit of it but um it seemed like what you were going through was kind of a bunch of War Stories Lessons Learned like you know you hear a lot about deep learning you know but there are a lot of things that people broadly believe about deep learning that actually are false and um why don't you explain kind of what your intent was for the talk and kind of walk us through you know an overview of what you presented cool so the way I see it is like there's these two competing VI views on deep learning like extreme views which is deep learning will solve all our problems and deep learning is complete garbage or sorry it's all hype that's kind of exaggeration but maybe for exaggerating views you can say that and there's evidence for each of these views you know like there's some amazing results of deep learning there's some like extremely poor results on deep learning right and the idea is that like these are not as informative of the stuff in the middle so the idea was like you draw all this evidence in like this one dimensional plane and you like try to like draw like a Max margin hyper plane you would might get like you this interesting decision boundary because like this is where the interesting stuff lies like this is the stuff that's going to be moving slowly over time if deep learning is doing well right or the other way if people are starting to like find all sorts of failure cases and the idea would be if we talk about like these examples like the edges of our understanding or the edges of our everything or like edges of you know like all the things that are limiting deep learning nowadays and like keeping us from solving all of our dreams um that can hopefully give people an impression of like what everything else is like because it's like just very extreme on the other ends of the spectrum and I feel like that's just not very much talked about cuz like you said like a lot of people are on the Deep learning hype train or um kind of being sad at home and like being grumpy cuz now all of the all of the questioners are silence dri um so if we if we kind of map out what the corner cases are and the failure modes and things like that it'll help us push forward our understanding of this thing is the basic premise and kind of like acknowledging it also um helps I don't think what I did was the greatest acknowledgement of it but I think it was a more thorough one than I've seen before and realistic especially in that um I think that sometimes just understanding your problem really well um really helps you to solve that problem so I know now that I mean like I I do research as well um and this stuff's very important to me um and by looking at it from like a kind of a higher level I can kind of see better like this seems like something that looks really promising to me or this doesn't seem promising at all right like for example um one of the problems with deep learning nowadays is everything's very local right like um local in what sense um you use the gradient right or maybe higher order uh derivative things but for practical purposes you use the gradient and this can be insufficient for some applications right and it makes you like by like going to a higher level um maybe I can start with a lower level right like SGD doesn't work for my spatial Transformer Network this is unfortunate like let me try Adam let me try RS prop but if you go to a higher level you realize that the problem is the local learning in the spatial Transformer network not necessarily the gradient descent so so tell us about spatial Transformer networks and what those are and yeah so this is just one example I use of a kind of network that it's very easy to see the issues of local learning with it's very nice because it's a it's a differentiable network it's very easy to see exploration problems in reinforcement learning domains but this is one that you have a derivative of and it should be easier to optimize and it is but you sometimes don't get what exactly you you it doesn't like fulfill its full potential so are you kind of seeing that there are a lot of people coming into the space that you know that you know try to throw deep learning at a given problem the common way of solving it is using stochastic gradient descent and they don't really think about you know how that's working and that it's you know finding a local optimization and there are some problems that you know for which they get kind of stuck in that local and that is unfortunately the case like I have seen many people introduced to deep learning who think that let's stitch together in architecture that's differentiable and then bingo bango call a day we've like solve problem X right like they they they they they realize the latest the the limitations of requiring large data sets but they they think that like that's what it amounts to and I think often time very often times it doesn't so back to spatial Transformer networks what they are is basically instead of like a single Network that learns how to classify an image you have two networks one of them learns which part of the image to look at and the other part takes what that Network looked at and does the classification on it and this is a huge Advantage because a lot of the times your input image might be really large and you don't want to run the network over all of it it might have like unnecessary information um it might be really useful to like co-localize so like have the wear as well as the what so there's really good reasons to use this and in fact for medical problems if it worked well I would use it for everything number one and number two is if it worked well I use it for every computer vision problem because what these spatial Transformer networks can do is not only find the region but it can also um transform the region into a canonical location so rather than having to learn filters of like cats at every orientation you might have to learn filters of cats at only one orientation which like would reduce in result in like much better data and parameter efficiency um but back to the issue here is that you have these two networks that are they're not competing but they're working together but they're only using the current Network the current other network as its source of signal basically so if your classification Network gets really good early on in training your localization Network gets stuck in this Optima right because like if it changes anything at least a little bit your classification network will do worse so like the Gent tells it like hey hey just just stay stay where you are you're pretty good or move you around a small region right which might be very far from the intended purpose right of like correctly like zooming all the way into the thing you care about and like rotating it a lot and on the other hand if the spatial Transformer Network um converges early so imagine the classification network is garbage it might zoom into like regions of the image that are just independent of the class but makes the classification Network tends to perform a little bit better on so um uh it might like for example if you're trying to classify kinds of dogs or like image net and it turns out like your classifier starts out like just being good at telling grass means dog and the localizer notes and like just zooms into the grass right like zoom zoom zoom grass and basically you've cut the dog out of the image and the moment you've cut the dog out of the image you get no gradient signal and when you have no gradient signal you're stuck there forever MH and this is a problem that people um just don't really like to acknowledge in networks right like that's that's actually a very complicated relationship because now you need to like maintain a balance in all of that and I don't think people even know how to do that like people don't know how to do it with generative adversarial networks either which is another example I gave of this yeah yeah huh um so what was what was the overall structure of your talk um so the title of the talk was deep learning modular in theory and flexible in practice so I want I first wanted to talk about the successes of deep learning not should not get any or rather to show that deep learning is very modular and it can do a lot of things and you know get them into the mode like wow we can solve everything and I actually think that I had a somewhat old claim to end that first part which is that deep learning today's deep learning components can solve any problem um any like computable problem um if you ignore the Practical aspect which would be I mean I think I think it's interesting to point out right because then now that you isolate that you know that the Practical aspects are the issue right and those practical aspects are data software optimization um in probably order of difficulty of how to understand them and the latter part of the talk I talked about these issues with deep learning like specifically data software optimization and at a final section of understanding just because I wanted to point out that while understanding is not necessary for like getting things to work which maybe is what we care about the understanding is very necessary to make progress right and we just it's amazing how little we understand about anything well let's come back to that and maybe walk through the different section so data um walk us through the the points that you were driving home around that okay so um from a super high level it's that neural networks are extremely data and efficient and they don't have to be that way and data efficiency is the root cause of all problems because if we were dat efficient the size of data sets wouldn't matter right um the data sets we use are kind of flawed in that like they have known issues that you know researchers know about um that issues like they're noisy or like kinds of known issues um like Pantry bank is a very small data set therefore making bigger networks is not very helpful because overfits therefore you should generally only publish regularization research on it or something like that so you're referring primarily to kind of the known data set Comm data set kind that's the kind of things that you know like the mainstream deep learning researchers publish on to convince them hey I have something cool use my thing right and that that is I mean it's important right like the alternative is publishing a DAT that no one knows about which is also very hard to get any information from but well one has kind of a it's almost like a reproducibility kind of issue where there are elements that are inherent to the data set that you know drive towards or or require a certain class of solution yeah it's a horrible State of Affairs where um like you need to like if you you know you read a paper the paper usually has the high level it doesn't have all the lowle details that's what the code's for and you implement the paper exactly as it says and it gets not anywhere near close to what they were they had right and you're like yo what the f um and then you know you maybe you email the authors maybe they eventually reach the source code and you run the source code because you won't believe them and you're like wow it just reproduce exactly what the author said and it turns out like it just has like a bunch of magic hyper parameters like you set you know L2 regularization to this you need this learning rate schedule for sure use this optim and um also preprocess your data set in this way and Sample it in this way and like these are all things that you really want to be robust to right and you just you just aren't right like that is it's it's a very unfortunate like aspect of the world right like you're you're put into this position where um if you don't do you know if you don't play the game you never get to the art results and people don't listen to you if you do play the game um I mean some people listen to you but some don't because they know the game mhm but then like it's the only way to get people to see your thing and Me by the game you mean in terms of the researchers like they're driven to publish you know you know winning the competitions for whichever data set that they're looking at is that usually not competitions but it's usually like you want to get people interested in your papers yeah and it's very different if you just didn't care and you wanted to publish interesting things right right um but if you want to get eyeballs sometimes like unless you're already a respected person it's kind of what you have to do yeah right so um like IDE like yeah it sometimes that kind of thing is important I think that it's kind of very qualitative thing um which is unfortunate in the data world that it gets to get a feel of a data set like when this data set's starting to get like really overfit that um perhaps it's not useful anymore and I feel like some researchers like qualitatively feel that about like cifr 10 and cifr 100 especially sear 10 I'm not 100% sure about cfar 100 is much what's that data set um this is is a data set of 32x 32 RGB images okay it's a popularly used Baseline because um it's a very small Baseline and images of anything in particular or C 10 has 10 classes so 10 Common classes um and they are um it's a popular data set because it's a really small data set 32 X 32 images you barely see anything and it's not mnist because people have like basically decided like mnist research is not enough mm so like they just don't listen to mist research at all right and it's starting to be that way for C 10 m just because we're getting to be so good on it now okay um and yeah there's just known limitations that makes it it makes it hard if you have a genuinely good result to tell people that you have a genuinely good result M especially because like as you scale up like it's also very computationally demanding right so and you you described the the data sets is being overfitted oh for sure which I explain elaborate on that cuz I tend to think of data as being inherently algorith the community is over fited they said not even the algorithm itself oh there's actually this cool test that someone did I can't remember who where they showed us like four pictures of images and they asked like these are these are the four data sets or sorry maybe not they said like do you know what data set this picture is from this picture is from this picture is from this picture is from right like many people did like Zar is a very canonical data set right um uh there's a placees dat set there's a large teen understanding one right and there's image net which is like more General so so you're basically saying that if someone can recogn we know these data sets so well we're designing solutions to them that are not generalizable or not adequately generaliz people have actually reported like negative results are generally not reported as much because just so much of it right it's a very empirical field so maybe this is uninteresting now right but um this just happen so much like people have noted that um the Inception architecture seem to work much better in imet than it does in other tasks um and it is a pretty complicated thing right so maybe maybe that makes sense or um I I've had friends that I talk to I'd hate that I a lot of my references are friends but there's like the field moves so fast right that like sometimes even archive can't keep up which is I think super awesome for being it where and anyway they chat about sometimes how res Nets um often times don't work for their computer vision architecture right or one of the the best um practitioners of using contets um a friend of mine s Deelan he works a deep mind he has not been able to find batchnorm to work for him and I find that to be really interesting like is it because all of his other parameters are tune to batchnorm is there something that he solved that batchnorm solves also that is not necessary is does is he just wrong um honestly I don't know but I think that there's a bunch of cool stuff there that maybe we can figure out right and and is this inherent issue inherent to deep learning or is it just the approach we've taken o um I mean I would argue that it's not even an issue in deep learning it's actually like maybe we can look at the bright side of this of like it's a miracle that even works um so um going to the understanding topic right there's as far as I know no practical theory in deep learning like there's nothing that can actually like guide us to understanding like there's what I call stories like every paper has like a highle story of this is why I think it works and if you like really try to vet the story really well you can like very easily like disprove that and I know of no story that's like 100% bulletproof um so I'm willing to make that claim and so we have these stories and like they they guide people but they they rarely work out as useful tools unfortunately mhm so what we have instead is empirical results what we do is we want generalization generalization is kind of like a lofty concept and we we don't really know like it's not like you can in like traditional statistics you can kind of do that um but like in deep Lear it's just much harder because you have so many parameters like you can't really measure well you can measure the VC Dimension but it's really it's so big that it doesn't matter um there's a lot of things that what's the VC Dimension it's I probably would screw this up but I'll give you like my best like first St approximation of what it is it's roughly how um powerful your model is so it shows it kind of corresponds to like how much data you need in order to get generalization okay so like very curvy powerful models have like a very high VC Dimension which means that you need a lot of data VC doesn't stand for very curvy does it no it stands for I know the v stands for vapnik okay um and the C stands for another person's name okay um sorry um so generalization like in a you know like in the very old school machine learning sense the sense that I don't think will come back to personally um like you could have bounds on like how much data you need in order to get like this Epsilon difference between train and test and stuff like that and that's just not something that's going to happen in deep learning and as long as we keep using deep learning we're probably not going to get that right so what we have is empirical results and with these empirical results we just have a bunch of experiments and a bunch of data sets and we show like it seems to work on the data sets we've tried um hopefully it works on everything and so like this is where you might see it as a pro but I sorry as a con but I see this as a huge positive of deepling right like it's actually super cool that it generalizes right like you can get a new computer vision task um I use computer vision because like that's one of the easier um domains and you can have a ton of data and you can just generalize you know you can use it to generalize you can use image net features to generalize in that that that's just not something that makes sense right um I mean like if you look at it from like a really strict perspective of like there's no guarantee that this should work but it tends to work and that's really interesting all right and I think that there's something uh about deep learning that allows it to generalize so well you know you can even generalize to domains that you've not even trained on I think that there's been some work on generalizing imag net models to cartoons and like even like cartoon drawings of the things that they were classifying sometimes activate or there something related to that yeah so yeah it's a wonder of deep learning I actually there are some experimental results that try to explain after the fact why things work but without being falsifiable it's questionable how useful it is um so perhaps maybe deep learning is exploting some of these kinds of explanations there was a recent one on physics okay that um that deep learning is the mo like deep learn the kind the class of things that deep learning is very good at fitting are a very like a very natural class of functions therefore since deep deep learning models only can fit like a efficiently fit a small subset of the function space but that happens to be like very common um like based on physics um kinds of functions that would occur okay H so you started out talking about data and that overfitting problem and then uh tools was that that or software software software I there's two more things in though which is that the data we have which is problematic there's a data that we so data we have and we use like data sets there data we have that we don't use and there's like tons and tons of data that we have that we don't use that I think that we just don't know how to use well um unsupervised learning multitask learning transfer learning we kind of use but we don't do very smart things I think uh um and even like this implicit stuff like the trajectories of the networks that you've passed through maybe there's some interesting information there and the last kind was the data that we don't have that we need like for example measuring these things that we really care about that we are just missing right now like we have we have no way of measuring long-term dependency like how well networks capture long-term dependencies we don't have like a general RNN Benchmark we don't have a good Benchmark for visual attention um we don't have a good Benchmark for hierarchical learning like how do we even know if we're learning hierarchical stuff right do we want to learn hle stuff um I don't know but like if I would think that if we want to learn something having Benchmark for it would be really good right so that that was roughly it for data um from a software perspective it was more about like how the tools we use nowadays really limit what we can do and like every tool is flawed in some ways this is hits home for me personally because I'm a software engineer okay um and I want to use really good tools you mean tensorflow doesn't solve every problem in the universe uh no not not yet I think they introd some really good ideas um they definitely brought something to the table mhm um but it it alone isn't enough um it might like the the like I think better things could be built on top of it I don't think that it's the low-l components that are a problem and I actually don't think like Hardware is that big of an issue like as big of an issue that people um make it out to be um in in in theory in practice if you really want s have the art results and things sometimes that's needed but there's like higher level problems that you can solve without Hardware so the idea would behind software is that you can like very like easily see situations where um like the software we have actually prevents us from doing what we want to do so I I think I have like two examples that really resonated with me where that um an example of bad software is when um it's easier to explain in words the technique than it is with code because ideally you want to like Express IDE you want like the flow from ideas to code to be really easy and the flow from ideas to words is generally pretty good and that just means like you have a bottleneck and like words to code and maybe it's a reality of life that it'll never be that simple did you provide a specific example um yes I had like a list of like many examples of like different kinds of um tricks that are hard to do in various Frameworks so depending on the framework you do some things can be kind of difficult so like for what is it for um so when you say tricks and Frameworks the basic idea being you know kind of the uh at you know the research Edge I did see that you th put a lot of paper you were just showing a lot of papers which is great documenting kind of where the ideas came from uh so in the research you know we're introducing all these various tricks to improve solvability of the of the deep learning networks and it's what I'm hearing is the tools are you know on the one hand you know great they're incre they're raising the level of abstraction and making this stuff you know more easily adoptable but you know that also prevents us from implementing some of these tricks which have to be plugged in at lower levels yeah exactly so um when I mentioned trick I use that a general term of like this like one unit of thing that you do to a neural network like um you can think of layers as tricks but tricks being more than just layers like for example an additional regularizer might be a trick right um or doing like they could be pretty complicated I think like doing unsupervised pre-training might be a trick and the argument that I would have is that no framework makes everything really easy and and Easy in this sense is that I would I would ideally like it such that um everything just gets solved for me like I would be able to like like this is probably not going to happen but we can get closer right like I would like to express like very declaratively like what I want this neural network to be like literally like take this neural network in this database apply this transformation um run this transformation um do it this like train on this training set like I want it to be that simple and I like I don't think it can be but like striving towards that I think is good sure and like a lot of the Frameworks like tensor flow um doesn't support a bunch of the thing like it makes it a large number of lines of code in order to do something rather than few MH so what would be an example like batch normalization is like a pretty simple thing right or sorry it's a it's actually not a very simple thing in terms of implementation but like many Frameworks can do batch normalization very very well like torch can do batch normalization amazingly because like they just implicitly keep it State and in torch like each of the nodes applies its updates on its own like when flowing through the grad and like applying the updates um so that's very good um but um tensor flow for example like in order to apply batch you have to have to do quite a few things right like you need to create like some state for if you're doing the rolling mean approximation you need to create some State for the mean some State for the variance you need to make sure to like apply the updates to this thing you need to only apply the updates at training time and then it becomes like much more complicated than just like calling a layer on something right um depending on how you wrap it of course but it like this this kind of thing is just a layer in torch right and like every framework has its trade-offs but I just don't think that we are at like the efficient Frontier yet of like this is like we like I I think we can get benefits for free basically and I actually have written a few libraries that um that try to get these benefits for free and I think they've been pretty successful um I'm still experimenting with them because I I think they just so much to do there but it's a it's an open problem and are these libraries uh are these Standalone Frameworks or libraries that plug into other existing Frameworks um mostly they go on top of Fano or tensorflow okay because I think that they actually are both um I think that they're both like very good baselines I'm a big fan of the computational graph MH um I think the design of theano is actually like quite excellent I'm a huge fan of theano and its developer is MH um it has the downside of distributed computing um but I think that its abstraction level is actually quite good like it can capture that abstraction level very well it's optimizations are like things that I probably wouldn't do by hand anyway so like you get them for free um it's it's a it's a very am more focusing on Theo tflow is similar but kind of as a mix of abstraction levels so um I'm focusing on the lowle aspect I think those lowle aspects are actually like quite good like they might actually be on an efficient Frontier of tradeoffs you know like trading off like usability versus um usability versus like um flexibility or yeah F yeah flexibility or performance and I think that that's like there's that's just one view right like use you know have computational graph have like all of the basic operations in there um optionally use an Optimizer in order to do that like another view would be like the torch or Cafe view where you bundle up the pieces of functionality that have a lot of like the the highly optimized pieces right and like that's the view you go for Max performance which I think is also very different philosophically but there's nothing wrong with either of these views so I'm I'm fine building on top of that just knowing what you're using it's more of yeah it's more of the level and how you construct the computational graph which I think should be independent of theano or tensorflow like these are just different levels right like you could have like a really nice lowlevel thing but change the high level thing on top of it and it should be fine which is why I not the biggest fan of tensor flows like many different abstraction levels and I think most of well all of the best people I've talked to who use stf flow um they kind of only use a little bit of it and they think that a bunch of it is like um it's not the greatest but I I don't care I'm not using it okay and like it it's at those high levels that I think is very interesting and like that's also where the user interacts with it right like if you're having code interact with code it doesn't matter you can have like the ugliest interface in in the world like your compiler can just you know switch syn around and all of that stuff okay so data software what was the third pie optimization optimization so I touched a little bit into it with local learning yeah and Andre karpathy had a great quote which I can't remember off the top of my head but it roughly goes along the lines of that neural networks only do memorization they don't do thinking m mhm and this is problematic cuz this is already not his quote but this is problematic cuz we'd ideally like them to think we want them to do like cool complicated things right that like blow our minds in their coolness right and they do blow our minds already but perhaps those things were simpler than we thought and what's going to happen when we want to do something pretty darn complicated right like we'll we'll see right like there's some tasks that we think that would require some pretty complicated levels of thinking in order to do perhaps playing Starcraft you need to like think many moves ahead and imagine what the opponent's going to do in order to right like take actions and neurals are not very good at imagining what to do yet um maybe that will change but we'll see um and Andrew in likes to say that un as a heuristic of what neural networks can do is anything a human can do in less than one second but I mean if that's a hard limitation mhm then there's a lot of tasks that take more than one second for people to do and right will this solve General AI for us not like when you phrase it that way right so um like it should be possible Right like it's modular in theory like you can just just have architectures that given a magic set of parameters would solve that task so the question is how do we do that right and it's there's there's just many tricks on that and I talk a little bit about the downsides of local learning how um we don't pay attention to exploration in supervised learning and like mostly it's pay attention enfor learning but we treat it as like um obviously the plane like there is some implicit um exploration because you're you know you're using sarcastic gradient descent so your gradient noisy but roughly if it wasn't noisy you'd you know be plopped on a point and you just he'll climb down some direction and be stuck there and like you don't even know how good of a solution that is right right so that's that can be I don't know like that that can very unsatisfying because if the answer is I mean this goes back to what I was talking about like in terms of limitations like maybe local learning just can't solve this right and that would be super duper unsatisfying because local learning is like our most scalable learning algorithm we have right like using gradients is really really good for training lots of parameters like we're going to have to make like a lot of plan like a lot of different plans if you want General AI without gradient right so yeah we we're going to have to figure it out or we're going to have to figure out tricks on how to do this better maybe tricks for more principled exploration and maybe this will make it such that these won't be problems anymore at least or we'll find much harder problems right there'll hopefully always be problems um and that what that's what keeps the field going right yeah yeah but hopefully they're not intrinsic to the way we do optimization and people are making better optimizers yeah um even though it's quite slow in progress right so data software optimization and understanding and we talked a little bit about that earlier are there are you going to post your slides up somewhere um probably I think that well the I think I've think the rly people put the slides up somewhere okay but they haven't asked me for the slides yet I think they're supposed to do that after the presentation okay which is probably good since there was like last minute editing going on um but it'll almost certainly be up somewhere okay and uh how can folks if folks want to learn more about what you're up to or find you do you have a GI ithub or Twitter or I do have a GitHub it's even though that's probably not a great way to contact what's yeah well not right github.com Diogo diogo1 149 okay and uh probably email would be the best way this is something I love chatting about it would be Diogo oh God my company name's hard to spell um analytic which is e NL ti.com okay great cool thankk you awesome hey thanks so much all right everyone that's it for today's interview please leave a comment on the show notes page at twiml ai.com talk8 or tweet to me at Sam charington or @ twiml AI to discuss this show or let me know how you liked it thanks so much for listening and catch you next time a [Music]

Original Description

My guest this time is Diogo Almeida, senior data scientist at healthcare startup Enlitic. Diogo and I met at the O'Reilly AI conference, where he delivered a great presentation on in-the-trenches deep learning titled “Deep Learning: Modular in theory, inflexible in practice,” which we discuss in this interview. Diogo is also a past 1st place Kaggle competition winner, and we spend some time discussing the competition he competed in and the approach he took as well. The notes for this show can be found at https://twimlai.com/talk/8. Subscribe! iTunes ➙ https://itunes.apple.com/us/podcast/this-week-in-machine-learning/id1116303051?mt=2 Soundcloud ➙ https://soundcloud.com/twiml Google Play ➙ http://bit.ly/2lrWlJZ Stitcher ➙ http://www.stitcher.com/s?fid=92079&refid=stpr RSS ➙ https://twimlai.com/feed Lets Connect! Twimlai.com ➙ https://twimlai.com/contact Twitter ➙ https://twitter.com/twimlai Facebook ➙ https://Facebook.com/Twimlai Medium ➙ https://medium.com/this-week-in-machine-learning-ai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from The TWIML AI Podcast with Sam Charrington · The TWIML AI Podcast with Sam Charrington · 8 of 60

← Previous Next →

Engineering Practical Machine Learning Systems with Xavier Amatriain - #3

Engineering Practical Machine Learning Systems with Xavier Amatriain - #3

The TWIML AI Podcast with Sam Charrington

How to Build Confidence as an ML Developer with Siraj Raval - #2

How to Build Confidence as an ML Developer with Siraj Raval - #2

The TWIML AI Podcast with Sam Charrington

Open Source Data Science Masters, Hybrid AI, Algorithmic Ethics & More with Clare Corthell - #1

Open Source Data Science Masters, Hybrid AI, Algorithmic Ethics & More with Clare Corthell - #1

The TWIML AI Podcast with Sam Charrington

Interactive AI, Plus Improving ML Education with Charles Isbell - #4

Interactive AI, Plus Improving ML Education with Charles Isbell - #4

The TWIML AI Podcast with Sam Charrington

Machine Learning for the Stars & Productizing AI with Joshua Bloom - #5

Machine Learning for the Stars & Productizing AI with Joshua Bloom - #5

The TWIML AI Podcast with Sam Charrington

Generating Labeled Training Data for Your ML/AI Models with Angie Hugeback - #6

Generating Labeled Training Data for Your ML/AI Models with Angie Hugeback - #6

The TWIML AI Podcast with Sam Charrington

Explaining the Predictions of Machine Learning Models with Carlos Guestrin - #7

Explaining the Predictions of Machine Learning Models with Carlos Guestrin - #7

The TWIML AI Podcast with Sam Charrington

Deep Learning: Modular in Theory, Inflexible in Practice with Diogo Almeida - #8

Deep Learning: Modular in Theory, Inflexible in Practice with Diogo Almeida - #8

The TWIML AI Podcast with Sam Charrington

Emotional AI: Teaching Computers Empathy with Pascale Fung - #9

Emotional AI: Teaching Computers Empathy with Pascale Fung - #9

The TWIML AI Podcast with Sam Charrington

Statistics vs Semantics for Natural Language Processing with Francisco Webber - #10

Statistics vs Semantics for Natural Language Processing with Francisco Webber - #10

The TWIML AI Podcast with Sam Charrington

Building AI Products with Hilary Mason - #11

Building AI Products with Hilary Mason - #11

The TWIML AI Podcast with Sam Charrington

Reprogramming the Human Genome with AI, w/ Brendan Frey - #12

Reprogramming the Human Genome with AI, w/ Brendan Frey - #12

The TWIML AI Podcast with Sam Charrington

Understanding Deep Neural Networks with Dr. James McCaffery - #13

Understanding Deep Neural Networks with Dr. James McCaffery - #13

The TWIML AI Podcast with Sam Charrington

Scaling Deep Learning: Systems Challenges & More with Shubho Sengupta - #14

Scaling Deep Learning: Systems Challenges & More with Shubho Sengupta - #14

The TWIML AI Podcast with Sam Charrington

Domain Knowledge in Machine Learning Models for Sustainability with Stefano Ermon - #15

Domain Knowledge in Machine Learning Models for Sustainability with Stefano Ermon - #15

The TWIML AI Podcast with Sam Charrington

Machine Learning in Cybersecurity with Evan Wright - #16

Machine Learning in Cybersecurity with Evan Wright - #16

The TWIML AI Podcast with Sam Charrington

Interactive Machine Learning Systems with Alekh Agarwal - #17

Interactive Machine Learning Systems with Alekh Agarwal - #17

The TWIML AI Podcast with Sam Charrington

Location-Based Intelligence for Smarter Marketing with Klustera - #18

Location-Based Intelligence for Smarter Marketing with Klustera - #18

The TWIML AI Podcast with Sam Charrington

AI-Powered Customer Support with HelloVera - #18

AI-Powered Customer Support with HelloVera - #18

The TWIML AI Podcast with Sam Charrington

Using AI to Simplify the Programming of Robots with Cambrian Intelligence - #18

Using AI to Simplify the Programming of Robots with Cambrian Intelligence - #18

The TWIML AI Podcast with Sam Charrington

Increasing Efficiency of Healthcare Insurance Billing with NLP, w/ Behold.ai - #18

Increasing Efficiency of Healthcare Insurance Billing with NLP, w/ Behold.ai - #18

The TWIML AI Podcast with Sam Charrington

Creating a Worldwide Financial Knowledge Graph with AlphaVertex - #18

Creating a Worldwide Financial Knowledge Graph with AlphaVertex - #18

The TWIML AI Podcast with Sam Charrington

From Particle Physics to Audio AI with Scott Stephenson - #19

From Particle Physics to Audio AI with Scott Stephenson - #19

The TWIML AI Podcast with Sam Charrington

Selling AI to the Enterprise with Kathryn Hume - #20

Selling AI to the Enterprise with Kathryn Hume - #20

The TWIML AI Podcast with Sam Charrington

Engineering the Future of AI with Ruchir Puri - #21

Engineering the Future of AI with Ruchir Puri - #21

The TWIML AI Podcast with Sam Charrington

Deep Neural Nets for Visual Recognition with Matt Zeiler - #22

Deep Neural Nets for Visual Recognition with Matt Zeiler - #22

The TWIML AI Podcast with Sam Charrington

Introducing Psycholinguistics into AI with Dominique Simmons- #23

Introducing Psycholinguistics into AI with Dominique Simmons- #23

The TWIML AI Podcast with Sam Charrington

Reinforcement Learning: The Next Frontier of Gaming with Danny Lange - #24

Reinforcement Learning: The Next Frontier of Gaming with Danny Lange - #24

The TWIML AI Podcast with Sam Charrington

Offensive vs Defensive Data Science with Deep Varma - #25

Offensive vs Defensive Data Science with Deep Varma - #25

The TWIML AI Podcast with Sam Charrington

Global AI Trends with Ben Lorica - #26

Global AI Trends with Ben Lorica - #26

The TWIML AI Podcast with Sam Charrington

Intelligent Autonomous Robots with Ilia Baranov - #27

Intelligent Autonomous Robots with Ilia Baranov - #27

The TWIML AI Podcast with Sam Charrington

Reinforcement Learning Deep Dive with Pieter Abbeel - #28

Reinforcement Learning Deep Dive with Pieter Abbeel - #28

The TWIML AI Podcast with Sam Charrington

Robotic Perception and Control with Chelsea Finn - #29

Robotic Perception and Control with Chelsea Finn - #29

The TWIML AI Podcast with Sam Charrington

Natural Language Understanding for Amazon Alexa with Zornitsa Kozareva - #30

Natural Language Understanding for Amazon Alexa with Zornitsa Kozareva - #30

The TWIML AI Podcast with Sam Charrington

The Power of Probabilistic Programming with Ben Vigoda - #33

The Power of Probabilistic Programming with Ben Vigoda - #33

The TWIML AI Podcast with Sam Charrington

Intel Nervana Update + Productizing AI Research with Naveen Rao and Hanlin Tang - #31

Intel Nervana Update + Productizing AI Research with Naveen Rao and Hanlin Tang - #31

The TWIML AI Podcast with Sam Charrington

Video Object Detection at Scale with Reza Zadeh - #34

Video Object Detection at Scale with Reza Zadeh - #34

The TWIML AI Podcast with Sam Charrington

Enhancing Customer Experiences with Emotional AI, w/ Rana el Kaliouby - #35

Enhancing Customer Experiences with Emotional AI, w/ Rana el Kaliouby - #35

The TWIML AI Podcast with Sam Charrington

Expressive AI-Generated Music With Google's Performance RNN with Doug Eck - #32

Expressive AI-Generated Music With Google's Performance RNN with Doug Eck - #32

The TWIML AI Podcast with Sam Charrington

Smart Buildings & IoT with Yodit Stanton - #36

Smart Buildings & IoT with Yodit Stanton - #36

The TWIML AI Podcast with Sam Charrington

Deep Robotic Learning with Sergey Levine - #37

Deep Robotic Learning with Sergey Levine - #37

The TWIML AI Podcast with Sam Charrington

Deep Learning for Warehouse Operations with Calvin Seward - #38

Deep Learning for Warehouse Operations with Calvin Seward - #38

The TWIML AI Podcast with Sam Charrington

Cognitive Biases in Data Science with Drew Conway - #39

Cognitive Biases in Data Science with Drew Conway - #39

The TWIML AI Podcast with Sam Charrington

Data Pipelines at Zymergen with Airflow, w/ Erin Shellman - #41

Data Pipelines at Zymergen with Airflow, w/ Erin Shellman - #41

The TWIML AI Podcast with Sam Charrington

Web Scale Engineering for Machine Learning with Sharath Rao - #40

Web Scale Engineering for Machine Learning with Sharath Rao - #40

The TWIML AI Podcast with Sam Charrington

Marrying Physics-Based and Data-Driven ML Models with Josh Bloom - #42

Marrying Physics-Based and Data-Driven ML Models with Josh Bloom - #42

The TWIML AI Podcast with Sam Charrington

Machine Teaching for Better Machine Learning with Mark Hammond - #43

Machine Teaching for Better Machine Learning with Mark Hammond - #43

The TWIML AI Podcast with Sam Charrington

LSTMs, Plus a Deep Learning History Lesson with Jürgen Schmidhuber - #44

LSTMs, Plus a Deep Learning History Lesson with Jürgen Schmidhuber - #44

The TWIML AI Podcast with Sam Charrington

Learning From Simulated & Unsupervised Images through Adversarial Training - TWiML Online Meetup

Learning From Simulated & Unsupervised Images through Adversarial Training - TWiML Online Meetup

The TWIML AI Podcast with Sam Charrington

Jennifer Prendki Interview - Agile Machine Learning - TWiML Talk #46

Jennifer Prendki Interview - Agile Machine Learning - TWiML Talk #46

The TWIML AI Podcast with Sam Charrington

Evolutionary Algorithms in Machine Learning with Risto Miikkulainen - #47

Evolutionary Algorithms in Machine Learning with Risto Miikkulainen - #47

The TWIML AI Podcast with Sam Charrington

Learning Long-Term Dependencies with Gradient Descent is Difficult - TWiML Online Meetup

Learning Long-Term Dependencies with Gradient Descent is Difficult - TWiML Online Meetup

The TWIML AI Podcast with Sam Charrington

Word2Vec & Friends with Bruno Gonçalves -#48

Word2Vec & Friends with Bruno Gonçalves -#48

The TWIML AI Podcast with Sam Charrington

Symbolic and Subsymbolic Natural Language Processing with Jonathan Mugan - #49

Symbolic and Subsymbolic Natural Language Processing with Jonathan Mugan - #49

The TWIML AI Podcast with Sam Charrington

Bayesian Optimization for Hyperparameter Tuning with Scott Clark - #50

Bayesian Optimization for Hyperparameter Tuning with Scott Clark - #50

The TWIML AI Podcast with Sam Charrington

Intel Nervana DevCloud with Naveen Rao & Scott Apeland - #51

Intel Nervana DevCloud with Naveen Rao & Scott Apeland - #51

The TWIML AI Podcast with Sam Charrington

AI-Powered Conversational Interfaces with Paul Tepper - #52

AI-Powered Conversational Interfaces with Paul Tepper - #52

The TWIML AI Podcast with Sam Charrington

Topological Data Analysis with Gunnar Carlsson - #53

Topological Data Analysis with Gunnar Carlsson - #53

The TWIML AI Podcast with Sam Charrington

ML Use Cases at Think Big Analytics with Mo Patel & Laura Frølich - #54

ML Use Cases at Think Big Analytics with Mo Patel & Laura Frølich - #54

The TWIML AI Podcast with Sam Charrington

Ray:A Distributed Computing Platform for Reinforcement Learning with Ion Stoica -#55

Ray:A Distributed Computing Platform for Reinforcement Learning with Ion Stoica -#55

The TWIML AI Podcast with Sam Charrington

Diogo Almeida discusses the challenges of deep learning and shares his experiences with machine learning competitions and techniques like boost decision trees and ensemble methods. He emphasizes the importance of understanding data, software, and optimization issues in deep learning.

Key Takeaways

Build a model with 50,000 features to solve a ranking problem
Train the model using a boost decision tree
Create ensembles of other people's solutions to improve performance
Use a combination of numerical and categorical features
Convert categorical data to numerical using different methods
Apply Spatial Transformer Networks to image classification tasks
Use stochastic gradient descent to optimize models

💡 Deep learning is modular in theory but inflexible in practice, and understanding data, software, and optimization issues is crucial for success.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques

Medium · Python

Learn Deep Learning by Hand (Beginner's Guide - Part 1)