Talks # 4: Sebastien Fischman - Pytorch-TabNet: Beating XGBoost on Tabular Data Using Deep Learning

Abhishek Thakur · Advanced ·🧬 Deep Learning ·6y ago

Key Takeaways

The video discusses PyTorch-TabNet, a deep learning architecture for tabular data that outperforms XGBoost, and demonstrates its implementation and usage for various tasks, including classification, regression, and multitask regression. The speaker, Sebastien Fischman, explains the architecture and its components, including attention transformer blocks, feature transformer blocks, and the use of masks for feature selection and explainability.

Full Transcript

okay so hello everyone and welcome to episode number four of talks and today it's very special today we have Sebastian and he's a data scientist who is currently based in France and has worked in France and Australia on topics ranging from user segmentation to real-time bidding and to predicting stock markets using sentiment analysis and automatic machine learning and is currently working at dummy medical on early-stage cancer detection so it's just amazing he's worked on almost everything and today he is joining us and I'm very thankful to him and he's going to talk about tab net which is a deep learning architecture for tabular data and Sebastian and his team have created PI touch tap net that enables us to use tablet with by touch and he's one of the core developers of Pi touch tablet so today's episode is a little bit special in the sense that today I announced the pre-order of the Kindle version of my book and now in collaboration with Sebastian I will be giving away one electronic copy of the book and this will be the first ever copy that I am going to be giving away all you have to do is one simple thing and that is Sebastian will tell you at the end of his talk so listen to the talk and if you have any questions related to the talk ask in the YouTube chat and now I'm going to hand it over to Sebastian so it's all yours alright thank you very much have you checked for for inviting me so I'm gonna share my screen at the presentation so I hope everyone can seize it so yeah today I'm going to talk about those tablet and by cups tablet and so the main idea is can you beat actually booster on tabular data with deep learning so we'll find out during the talk so first I'm going to talk about tab net which is a paper from thousand 19 called attentive interpretable tabular learning so what about this paper so it's written by Eric and feisty from Google Flavia so it's a very recent paper so the first version is August 2000 19 and it's still under development since you can see there's a false version in February 2020 there's also the official code and tensorflow available so I shared the presentation somehow so you can be you can click on the links and see everything so the main statement about this paper is neural networks have established state-of-the-art results in all topics like computer vision and LP all the complex topics but not on tabular data I mean nowadays HG boobs like GBM and CAD boost asked you're winning most of the Carioca competition for tabular data so the idea is how can you fill this gap and also it's important to fill the gap without losing explained abilities so nowadays you want a good performance but also you want to be able to understand what's going on with with your model so this is the main statement of the of the paper so I've just put here a very short summary of the main ideas for tablets so the first very important thing is the attention transformer blocks so this enables instance wise feature selection and it's both for performances but also it allows you to have a built-in explain ability derived from the mask so it's it's trying to beat all the algorithms with with with this and the the fact of using attention also allows you to have a more efficient learning capacity since you're only focusing on your on your important features there's also future future transformer block which is quite usual for for neural networks but here you can you can see that it's not a radio activation but a gated linear unit activation and there's also something quite innovative which is sharing some simulations inside you architecture so we will get deep in the public about this and there's also the sequential steps that mimics and stumbling to boost the performance and also allows your model to have a larger capacity so the results from from this architecture and this this paper are quite amazing so this are what the the authors says in the paper so for forest cover type data set which is a nine class classification problem tablet outperforms FG boobs get boost energy BM by more than seven percent in accuracy on poker hands data set which is a data set while you need to to learn what kind of handy having in poker you have more than thirty percent accuracy with tablet against a caboose caboose and niobium so this is huge for rossmann install say also which is I guess I I think a a Cargill Cargill data set you have tablet outperforming mean squared error so this is a regression problem and also the authors say that they they can be a G boost and can boost on for different classification data sets in in k DG data sets so it's quite impressive and when I read this I was probably like you right now thinking okay this is just too too nice to be true so I wanted you to try it myself so what I can say here is that the claimed results about tablets are actually reproducible so it's it's working but the the boosting scores given in the paper I actually enjoyed estimated so it's not it's not gonna be a huge improvement but still interesting paper so now let's learn to be a architecture so basically this is the all you need to know about cabinets so the architecture starts with a bottom layer so you take your raw input features and they go through a bottom layer and then what happens is that you have different steps and all the steps are the same so if you don't understand what one step is doing you understand the entire architecture so you have feature transformer blocks you have attention attentive transformer blocks that creates a mask that will mask your feature for the next feature transform a block and then each feature transform of that creates both predictions and the next feature is for the the attentive transformer so basically what you can see is that the output of the model are summed so every step have some some their predictions and you have a final layer here that allows you to make any type of of a type of problems so you can do multi class specification binary classification regression you can do whatever you want and you can also see I'll get into the details later but there's also you can use the mask to get information and have the feature attributes so that the mask allows you to to give information about what is your model using to make the predictions so if we go into each step to understand better so the feature transformer block this is an example of four consecutive GRU blocks so one G blocks is just this part so it's a fully connected layer followed by a bottom layer and then a gated linear unit activation so the gated linear unit activation is very simple it's this so it's just the sigmoid times a the the input features so and this is a bit wise multiplication so pretty simple activation and so you have different gate edenia you need so you have four here and what you can see is that there are some shared shared blocks and independent blocks so the idea is that in every step the two first here the two first time Cheerilee blocks are going to be shared across all steps and those are dependent so you are sharing some weight so that the the model is sharing some weights across different steps you can also see that there is a skip connection here so at every GRU block you just as the initial input through next step so this allows you to train deeper models and allows you to have a smooth training about the input size I think it's important to understand this so the the inputs you get is your initial feature so you have n features the initial and then the output size are two parameters that you can choose so there is one nd which is the decision the decision dimension and the other one is the attention dimension so as you saw before there's a split here so the decision is going through the rally and the the attention part is going so to the next step of attention attentive transformers so what is a transformer so it's this so pretty simple what happens is that you have a fully connected layer then again Bachner and then you have a prior scale I like to thank this just now and then the sparse Mac so what what's the prior and what is it doing so the idea is that at the beginning your prior your first prior is gonna be all ones and the prior is going to tell you what do you know about the features and how much have you used them in the previous step so at the beginning you don't know much you don't assume anything about different features so you put everything to one then for the next steps your prior is going to be a multiplication of all the previous steps and so it's going to be gamma minus the previous masks so gamma is and other parameter from the this architecture and gamma is always greater than one and if if you you set a gamma close to one you'll see that every time your mask is going to be it's going to be close to one you have here something very small so this allows that the model not to use the same features at every different steps so if you set a gamma close to one you have different features for every steps and if you set a large gamma you can reuse the same features at every step what's this path my expansion over here so this path back function is like a soft max function but a bit sparse for so it's very less we're not going to go to the details of the mathematics but what we can say is that the sparse max output probabilities that sum to one and a lot of dimension will be 0 so the idea is that after this layer you're gonna output a lot of values that are zeros and the rest are going to sum to 1 so that's how you create the mask and the mask is directly applied to to the input features so about the sizes here so you get na at the input size and then you go back to the number of features so that you can apply the mask to the to the the next step so at the end so if you want just to do predictions it's quite easy so you can stack as many steps as you want the more steps the bigger your model is obvious obviously so it's going to Train longer and the idea is that each step as a specific max so uses its own features and when you're going to to make the prediction every every step is just outputting the the values over there so you can you can see that every mask is just every step is summing its output so it's quite it's quite straightforward how how they work and when you want to look at the explanation so basically what you do is just some every every different mask in every different steps so you can understand what was the model seeing when it was doing the prediction and basically it's very easy if you have been mask a lot then it means that you're not really useful but if you have been used a lot in your own important feature so we'll have example of this later this is a way to represent individual explained explained abilities so what's interesting with this architecture is that your importance is changing at every single example you give so this is from secretary Miller who is using tab net and the idea is that you can see four different samples you have different mask responses so this is a way to really have individual explained ability with the model and if this is not really easy to have for other imagining problems and with this with this architecture its way way easier there's also a part which is not implemented in invite of tab net but I'm talking about it since it's the latest version of the paper the idea is that if you have a very small data set or not enough labels so if you are you know semi-supervised the problem settings then you can try to train your your tablet model without using the labels so how can you do this the idea is just you randomly mask some inputs and try to predict predict them back in the in the output so this is a way of pre-training your your weights so that the model would perform better so I think this is also an important feature because you can even though the planning usually needs quite a lot of training data with this kind of techniques you don't actually need that many examples so this is it about the the article so I guess all those all those new ideas are great and it's really amazing to see this but can really really trace and reproduce those results so the tensile code was not available at at the first first time I read this code this article so now it is but it's not really reusable and it's hard coded for some specific datasets so the idea is how can we simply try this model new datasets for tackle competitions or synagogue lattices since it's supposed to be such a great architecture why can't we just use it on new datasets so that's where PI touchdown it's come so this is a obviously an open source implementation of tablet so first about this I wanted to thank Eduardo for helping me and like doing half of the work to create this this architecture in in Python I would like to thank also Bezier who helped us having this packaged nicely and I wanted to thank a dream quark which is a my previous company who we did this during our research research parts of time so thank you for that and also I would like to thank all the contributors it's a it's really an open source project so feel free to contribute the link is here the link is it's easy to find so what about I dropped our net so the idea is it's open source and again feel free to contribute but it's also really easy to use the ideas that you can keep install by first our Nets so it's really really easy to install it and try it on another data set and it's like it like compatible so the idea is that really you don't need to learn about a new way of doing machine learning it's as easy at any psychic psychic model it's also GPU and CPU compatible so you don't have to worry about am I am I using correctly my GPU or I am I need to switch to my CPU everything is automatic so you shouldn't have any problem and also interesting things is that the our implementation allows you to do classification like binary and multi class but also regression and multitask regression so this is something quite hard too hard to find I think you can't do this with eggs abuse so if you have a multi task for aggression problem give it a try also we made all the parameters easily accessible so it should be fairly simple to to to try to do hyper panels are tuning and also there's a first day I rapper so the guys from first day I am so our project and created a rapper so if you have an if you're using pasta well you can just try try the implementation there rapper so what can you expect so first I want you to to make things clear that this is not an official Google implementation I didn't talk with any any guy from Google's I would be very happy to do so so if there is some Googlers watching this I'm happy to to to get their def feedback but what I can say is that the the results from our implementation are comparable to the the result claimed in the paper so everything everything is written you can you can do it with our implementation I need also to let you know that it's gonna be probably long trainings it's training longer than an activist model it's going to be faster if you have GPU obviously but it's also going to be faster if you have problems large number of classes because IG booster gets raised them and about competitive results I can't make any promise here it has shown some competitive results on other datasets so what I just I can tell you is that you can just try and see we made things very simple to to try such as try and see so let's now get a D mail so I've put some some links here but you'll be able to look at them when I share the when I share the presentation but I wanted to to show you the domain name tutorial so this one is on sentence example so it's census income datasets from this yeah sure a better all right so this is this datasets it's a very common data set in research so it's one they talked about in the paper so the idea is just unloading your data set then you read it we're gonna here just make a very simple setting with eighty percent of our data in the train set ten percent in the guard set and the rest in the test set so here is how it looks like so basically you can see that there is no headers so I can't give you the name of all the all the different features but you can see that you have features like where do you work where do you live probably what's your marital status things like that and what we're trying to predict is whether your income is larger than 50k so about the pre-processing here you don't need a specific processing for using tablet the only thing you need is to get rid of all the strings part so here which is doing lab encoding on basically every every features so you can see that you need tend to use the features and you need to to say if you want to do embeddings you need to say what are your categorical or category called features and what are the dimensions of them so this is something you have to do and basically it's very simple to do it's encoded like this so that's it for the processing and then it's very simple to call a tender classifier so I didn't show it before but you can just import any classifier like this and then you can create a classifier doing this so tablet classifier then you specify your different cat index cat games and the size of your embeddings so you can if you say one it's going to be one for all your different features categorical features if you just have a new cat so this you can see all the available parameters so for example some of them are you shouldn't use them like I don't know the random seed here so one important thing is that we've fixed everything so if you run a tablet twice is gonna give you the same results so it's quite interesting you can play with the random seed here but for example here you can decide how many independent layer you want for every step how many shared layers how many steps you want the momentum so this these are parameters that we made available but you shouldn't probably trick them too much then you have the size of the attention the size of decision so this is just very simple just changing some parameters then you create your training set with your training labels but it said valid labels etc very easy and then there is the feed part so it's it's more like how we would use an extra boost area so you have Hugh parameters for feeding so those are deep learning parameters like the batch size and the actual - sizing well so this is coming from those batch gnome I've talked about it later a bit it's not very very important so basically when you run this what you'll see is this training happening so it's a bit long but you can see you can train this in two hundred sixty seconds it's about five minutes I guess here you can access all the history of your training so here is the last so this is the dog loss for binary classification and when you're in looking at the metric this is a area under the curve so this is the rocky icy score now you see a score and if you want to do prediction it's really really easy just call a predict problem to get the probabilities like in any cyclic learning algorithm and you can see that here we have the test test result that is fairly similar to what we had in validation you can access the global feature importance by simply calling CLF that feature importance is so very easy and you can also make explanations so the idea is that when you want to understand why your model is taking a decision you can walk like you can use the prediction but also use the mask to get your explanation and it's that easy just call this you'll get all the masks and the explanation metrics so this is an example of what's going on with three different masks and fifty fifty different fifty examples so you can see that the the first mask is actually doing is masking different features so it's doing something very different for each input the next one is concentrate is focusing on these three features probably and the last one is mainly focusing on only two features so this is good to understand what's going on and in our examples there's always an extra boost on a classifier at the end just to to make things fair and to let you train different models and see what's going on so basically here you can see that the test error is 9200 six I think we had slightly better results with with Abnett but it's really not a huge performance gap but it's still competing with with every booth so that's cool and it's very easy to use and I mean if you look at how many parameters you have for tablet you have a few that if you go down and look at all the parameters you get for a G boost it's quite a lot as well so the idea is that everyone knows about hg boost but actually if every everyone knew about tabanid probably we would be able to use it and fine-tune it very very well you have other examples all those are in the repo so if you go to the to the repo you can have all those examples here this one is forest cover type so this this one is doing the exact same thing as the paper so you have here the exact same splits as the one in the paper so I didn't run it because this one is a big datasets and I'm on a very small laptop so it's going to take ages but if you have a GP when you want to try train this you can see how how it's performing and you can try to reproduce the paper results there's also a regression example so it's as simple as saying I don't want to use a determinate classifier but at Arnett's regressor so just import our net regressor and then everything just the same as before you see each is called talas regressor you train your model and then you you get their results and for multitask regression it's you don't have anything to do so the idea is just if you're if your outputs are multi multi dimensional then and you're asking for a tetanus regression in multi-dimensional then it's going to take it's gonna make everything works without any problem so yeah I think that's it for the tutorials also I mean you can see that it's very c22 to try it so some people already started using it on kyle competition just to to to give it a try so please please have a look at different different kernels using using cabinet and just let me know about the results somehow in the by sending a message on the on the ripple so i can show you a little bit about the code itself i won't go into the details because it's going to be too long but it's very simple so the idea is that the the repo is actually just a domain yep so there is two two main to main files so the first one is the tab network so this one is going to define the architecture itself so you will find all the different pipe part I talked about so this is the ghost bash norm so you can test all you into the paper as well there is a tablet normally Noah medics class so this one is just a tab net without any embedding available and basically you'll have all the steps so you have future transformers attentive transformers and everything and then the forecourt that is doing everything work if you go into the into the crowd you'll see that implementing it was not not that easy and there is lots of different tricks lots of different things that are not clearing to the in the paper by reading the paper that you can understand only if you you're trying to implement it so have a look at the of the code and then there is tab model which is doing all the training so this one the tab network is for the architecture and the tab model is for defining the class is easy to use with all parameters accessible so that's where you'll find a terminal classifier tab nothing and all the basic blocks of how to have to train a neural network so feet you know you have the feet on a part how to train a network things like that so this is the main class so you'll find each specific class later on so when you define time the classifier you'll need to define different functions so I won't go into the details but the code is available go and have a look no problem so I think I'll come back to my slides now so yeah about the tricky implementation I just wanted to give some some advice about what was hard for us to implement so first when you see the drawing for from the architecture you're like I need to share everything on different shared layers but if you do share your bash gnome nothing is going to train so this is something that is not clear from the paper that you can't share the batch know because the inputs are not going to be the same on every different steps so if you share the bash no it's not gonna work also the ghost bash gnome is a is quite mysterious concepts it comes from my a paper research paper available here and there is no official implementation in Pytor so we we struggled a bit too to have one working and also I was always the bitter I didn't know about this why there is a we are the multiplication in this deep connection so we put it but I still don't don't know why is there and does it help so those are just things you you're trying to make it work and I guess the the research paper says to do so so we're doing so in this implementation yeah I wanted to say something about we also tried to do something more than the research paper because this is research work so for example about the embedding dimensions in the paper they say it could be better in terms of performance is to have larger dimensions and embeddings and I guess this is very true but they don't want to do it because they don't know how to do the the explanation if if they change the dimensions so what we did is that we allow users to put any any size of embedding but what we do is that we sum the different dimension that comes from the same embeddings to get to give you an explanation so if you're using a implemented in you you don't see any differences in terms of explanations but you can change any you can put any embeddings you want so this is mostly for conclusion I think the idea is that this is work under construction so you can all participate if you want and I just put some IDs I have here the first idea is it would be great if we could have a benchmark framework so that we can easily try new implementations new new algorithms and say ok this was a good idea this work well and I think this is something that is missing in in research for tabular data at the moment and that's why I had to implement the framework to to see if results are reproducible so it would be better if there is there was a benchmark framework we also thinking about adding callbacks to the to the are implemented so that when using python tablet if you want to use a specific custom loss or change the only stopping metrics or thing like that you can do it easily so we're working on it so that you can use bytes of standards in any any settings those are more research topics so i've created issues in in the in the repo explaining a bit more each of those research topics so this one is changing attention transformer input so one problem I see with the current implementation to the current paper is that future transformer takes as input the output of the the mask that every every step is has masked input so they don't see the unmasked data to make their choice so it seems like a loop like so you're masking something and then you're asking put the other features be useful to you to this problem and I don't think you can you can answer correctly if you don't see the other features there's also one other problem I see is that the math now are summing to one but if you apply a mass that is something to one you'll change the features for the feature transformer that is coming after you so for example how can a feature transformer by the difference between someone who is 50 years old and the mask is saying that the very important feature the age is only important at 0.5 then the future transformer will see someone 25 years old so that this is a bit weird to me so maybe it would be better better to try to switch from past mask to binary mask I'm not sure how easy it is to do if we can just set thresholds there is an open issue about this so if you if you want to have a try just just code also I started implementing a fork doing embedding away attention so the idea is that now the the masks are independent from the different different for all the different features but if you're doing embeddings when with large sizes then you don't want to be both training your model to create embeddings and then a mask with a mask most of the those dimensions so the idea is to create an embedding away attention this can be do with scattered techniques so I've I've created an issue as well another idea was to create a booster tablet so currently the sequential steps are just summing each other to make to do to make a prediction but what if we try to each step predict something different so doing a residual step so the first step is predicting the row label the next step is trying to predict the difference between the previous steps and etc so this would be more like what an extra boost is trying to do with different trees so I think this is an interesting way to to try new things with fighters on it so if you have time if you want to try please just do it and that's it thank you very much I think I'm gonna disclose what's the what's the way to win Abhishek's a book do you want me to root to say it now shake so the idea is so I hope you understood what I was saying but if some things are not clear you can come and ask your questions on the repo so you just go into the tablet I touched on the treble you can create issues so you just go and create an issue and if you have any questions so you'll probably go into bag report but you're just writing your question here so doing my question anything then you set label that it's gonna give you a specific label saying abhishekh ebook and so for I will try to answer all the questions and the idea is the best issue in the best question asked on the on the repo is going to win one example of abhisheka ebooks so don't hesitate just come and ask questions and I'll try to answer them and I hope one of you it's going to be the winning guy thank you very much and yeah so now you guys know how to win one copy of the e-book and we got a lot of questions so we got a lot of questions already for you so do you like I will start with this one do you have any explained ability output or examples if it's going to be quick well I guess this is one very concrete example so this is so you have the global explained ability so this is what you find in any any model but this is so this this one is one example and this is the explained ability of one mask this is explained in Vig of mask mask one has to and this is exactly what the model answers so you can plot it the way you want so I think some people don't understand this is drawing but it's basically just these feature importance is for every example so it's a local explaining ability okay and some other questions were about the slides are you going to share the slides yeah yeah yeah I'm gonna share this fly so I lent you to see with you how what's the best way to to share them but yeah definitely [Music] people are still asking about masks so let's let's talk about them later what are the hardware requirements can we run it without a monster GPU so I mean I think during the talk I launched I launched the this notebook so you can see that this is very it's a very very small laptop and I've gotten like a four for calls and 8h gigs of ram so yeah it runs in any come children so you you'll just have to be patient if you have a bad computer and it's gonna be way way faster if you if you have a GPU fast tablet yeah so the the first tablet is the last day of rapper so I talked about it it's really our code in the backend and it's just a wrapper for fast AI so they are dealing with the embeddings just a little bit differently than ourselves but it's our code wrapped into pasta area in case you can we also use DP use with fast tablet I don't know there were some questions about step size so is there any rule of thumb for number of steps so like based on the number of features yeah so can you still see my screen now yeah so the idea is that in the repo I've tried to explain all the different parameters and also put some typical ranges for the values so typically for the steps you can go from three to ten I would say two to five is a good starting point but yeah and if you go to the research paper as well they have a nice appendix with some tuning trip tricks so you can you can have a look at this Oh ask questions okay yeah that's that's great that you have listed out some of the parameters can can we use it for time series data so I guess you can use it for time series data the same way you would use an egg shape is more for time series data so yeah yeah exactly so that the architecture has nothing specific to time series but you can still a create application and try try it with this but no it's not specific to time series but you can use it as it juice okay and what about the virtual that says why do we use it so we use it because it's written in the paper so the idea is that I mean one one thing I didn't say it's portent is that with deep learning methods usually you said very small batch sizes so sometimes if you go about 64 it starts to be a large batch size but here a thousand is quite small so you can try up to 16,000 as batch size and the virtual batch size is actually gonna gonna do smaller batches in the background but large pasteurize is going to allow you a model to train a Quaker so it's a way of having good results and training faster so that's why we have that's how the ghost national works okay and can can be used for multi-label classification tasks you mean miss the class multi-class yeah so I guess so you can do binary classification nutri class classification multitask regression that we can't do multitask multitask multi classification for the moment but it might be something we can add it's not very difficult so again it can can be added now there are a bunch of questions about feature engineering so I'm just going to ask all of them and then you can probably explain everything as an inner gist or something okay so does this minimize feature engineering so the results from paper are on raw features or was there any kind of data preparation involved so how do we do the feature selection of data is tappman good and making its own features its own feature engineering tasks and then there are questions on the feature importance method which isn't implemented and how does it perform this selection of features okay so about creating new features so feature engineering for tabular data I think it's it's always going to be important so if you create a new feature that works well I guess adding it if it works well for a ship with adding it to tablet will help as well one thing that probably tablet doesn't does not need is a feature selection so if you want to reduce the number of input features maybe it's not needed with a with a tab network because the masks are going to select the best features itself so you don't need to reduce your number of features to get the best results but it's not a rule and about the importance that the importance is coming directly from the mask and it our multi plans it's a bit complicated I won't go into the details but if you have a look at the other paper you you'll understand okay there are there many more questions and if I guess if you have time later on then probably you can take a look at them and people are already aware of how to ask questions and one question from my side so you worked in real time bidding industry so yeah I think these kind of models can be deployed to these industries where we really need a fast response time or we are still stuck with logistic regression so it's it's not gonna it's not a fast algorithm so if you have a really like time issues I wouldn't go for tablet right now even even if the the forward is quite it's quite fast so I'm not sure it's a it's a I think it's important if you want to explain every every response UK so it's small for the finance industry true and do you have any benchmarks against like GBM attributes um where we where people can see them and also against rapids so so as I was saying there is no official benchmark of our implementation but all the all the examples here are in the repo so you can try them and also you can have a look at what the researchers from Google said so there is a different in the paper you have different tables with all the results so they are trying to make the benchmark to to get accepted to interesting conferences so many questions coming but I will ask one final question if you have time for this yeah sure can you can you comment a little bit about how to handle data imbalance issues and regularization z' using tablet okay so that there is one one parameter that will help you with unbalanced data set so it's let me find it for you i think it's going to be a fit parameter so when you're doing your your feet you have the weights here so if you set it to 0 nothing will happen it's just gonna train normal if you set it to 1 it's going to be class balance so it's gonna be every class is going to be used the inverse number of of the presence in the data set and if you just put a dictionary saying this class I want this weights so it's like in h-e-b so you can use the weight spotter training the same way you would use it for LG boost and this what it does in the background is is a shuffling the data set more frequently with the weights you ask okay I think there's still many questions coming in I think we don't have time for that we're already a little bit over time but Thank You Sebastian very much and my torch tablet looks very interesting and I'm eager to give it a try on the next tabular data set I encounter and thank you for working on it and making it available to everyone well thank you very much thank you very much was a pleasure for me to present this and to be here with you thank you thank you

Original Description

Talks # 4: Speaker: Sebastien Fischman (https://www.linkedin.com/in/sebastienfischman/) Title : Pytorch-tabnet : Beating XGBoost on tabular data with deep learning? Abstract: #DeepLearning has set up new benchmarks for Computer Vision, NLP, Speech, Reinforcement Learning in the past few years. However tabular data competitions are still dominated by gradient boosted trees (GBTs) libraries like XGBoost, LightGBM and Catboost. Tabnet is a new promising deep learning architecture based on sequential attention transformers proposed by Arik & Pfister that aims to fill the gap between GBTs and neural networks. Pytorch-tabnet is an open source library that provides a scikit-like interface for training a TabNetClassifier or TabNetRegressor. It's ease of use allow any developer to quickly try a #TabNet architecture on any dataset, hopefully setting up new benchmarks. Bio: Worked as a Data Scientist in France and Australia on very different topics: - user segmentation based on their shopping habits for WoolWorth @Quantium - real time bidding advertising @Tradelab - stock market predictions based on sentiment analysis from social medias @SESAMm - auto ML platform with explainable AI @DreamQuark - now working on early stage cancer detection on new OCT-3D images @DamaeMedical To give a talk in Talks, fill out this form here: https://bit.ly/AbhishekTalks ---- Follow me on: Twitter: https://twitter.com/abhi1thakur LinkedIn: https://www.linkedin.com/in/abhi1thakur/ Kaggle: https://kaggle.com/abhishek
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Abhishek Thakur · Abhishek Thakur · 28 of 60

1 Episode 1.1: Intro and building a machine learning framework
Episode 1.1: Intro and building a machine learning framework
Abhishek Thakur
2 Episode 1.2: Building an inference for the machine learning framework
Episode 1.2: Building an inference for the machine learning framework
Abhishek Thakur
3 Episode 2: A Cross Validation Framework
Episode 2: A Cross Validation Framework
Abhishek Thakur
4 Tips N Tricks #2: Setting up development environment for machine learning
Tips N Tricks #2: Setting up development environment for machine learning
Abhishek Thakur
5 Episode 3: Handling Categorical Features in Machine Learning Problems
Episode 3: Handling Categorical Features in Machine Learning Problems
Abhishek Thakur
6 BERT on Steroids: Fine-tuning BERT for a dataset using PyTorch and Google Cloud TPUs
BERT on Steroids: Fine-tuning BERT for a dataset using PyTorch and Google Cloud TPUs
Abhishek Thakur
7 Special Announcement: Approaching (almost) any machine learning problem
Special Announcement: Approaching (almost) any machine learning problem
Abhishek Thakur
8 Training BERT Language Model From Scratch On TPUs
Training BERT Language Model From Scratch On TPUs
Abhishek Thakur
9 Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-1)
Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-1)
Abhishek Thakur
10 Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-2)
Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-2)
Abhishek Thakur
11 Episode 4: Simple and Basic Binary Classification Metrics
Episode 4: Simple and Basic Binary Classification Metrics
Abhishek Thakur
12 Training Sentiment Model Using BERT and Serving it with Flask API
Training Sentiment Model Using BERT and Serving it with Flask API
Abhishek Thakur
13 Episode 5: Entity Embeddings for Categorical Variables
Episode 5: Entity Embeddings for Categorical Variables
Abhishek Thakur
14 Tips N Tricks #5: 3 Simple and Easy Ways to Cache Functions in Python
Tips N Tricks #5: 3 Simple and Easy Ways to Cache Functions in Python
Abhishek Thakur
15 Multi-Lingual Toxic Comment Classification using BERT and TPUs with PyTorch
Multi-Lingual Toxic Comment Classification using BERT and TPUs with PyTorch
Abhishek Thakur
16 Text Extraction From a Corpus Using BERT (AKA Question Answering)
Text Extraction From a Corpus Using BERT (AKA Question Answering)
Abhishek Thakur
17 10K Subscribers: Approaching (almost) Any Machine Learning Problem and Talk Show
10K Subscribers: Approaching (almost) Any Machine Learning Problem and Talk Show
Abhishek Thakur
18 Data Processing For Question & Answering Systems: BERT vs. RoBERTa
Data Processing For Question & Answering Systems: BERT vs. RoBERTa
Abhishek Thakur
19 Tips N Tricks #6: How to train multiple deep neural networks on TPUs simultaneously
Tips N Tricks #6: How to train multiple deep neural networks on TPUs simultaneously
Abhishek Thakur
20 Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM-RoBERTa And Many More
Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM-RoBERTa And Many More
Abhishek Thakur
21 Talks # 1:Andrey Lukyanenko - Handwritten digit recognition w/ a twist &  topic modelling over time
Talks # 1:Andrey Lukyanenko - Handwritten digit recognition w/ a twist & topic modelling over time
Abhishek Thakur
22 Episode 6: Simple and Basic Evaluation Metrics For Regression
Episode 6: Simple and Basic Evaluation Metrics For Regression
Abhishek Thakur
23 Talks # 2: Subhaditya Mukherjee - Image restoration using Deep Learning: Dehazing
Talks # 2: Subhaditya Mukherjee - Image restoration using Deep Learning: Dehazing
Abhishek Thakur
24 Basic git commands everyone should know about
Basic git commands everyone should know about
Abhishek Thakur
25 How do I start my career in Data Science?
How do I start my career in Data Science?
Abhishek Thakur
26 Talks # 3: Lorenzo Ampil - Introduction to T5 for Sentiment Span Extraction
Talks # 3: Lorenzo Ampil - Introduction to T5 for Sentiment Span Extraction
Abhishek Thakur
27 Detecting Skin Cancer (Melanoma) With Deep Learning
Detecting Skin Cancer (Melanoma) With Deep Learning
Abhishek Thakur
Talks # 4: Sebastien Fischman - Pytorch-TabNet: Beating XGBoost on Tabular Data Using Deep Learning
Talks # 4: Sebastien Fischman - Pytorch-TabNet: Beating XGBoost on Tabular Data Using Deep Learning
Abhishek Thakur
29 Build a web-app to serve a deep learning model for skin cancer detection
Build a web-app to serve a deep learning model for skin cancer detection
Abhishek Thakur
30 Talks # 5: Parul Pandey: Data Science, Diversity and Kaggle
Talks # 5: Parul Pandey: Data Science, Diversity and Kaggle
Abhishek Thakur
31 Implementing original U-Net from scratch using PyTorch
Implementing original U-Net from scratch using PyTorch
Abhishek Thakur
32 Tips N Tricks # 8: Using automatic mixed precision training with PyTorch 1.6
Tips N Tricks # 8: Using automatic mixed precision training with PyTorch 1.6
Abhishek Thakur
33 Talks # 6: Mani Sarkar: From backend development to machine learning
Talks # 6: Mani Sarkar: From backend development to machine learning
Abhishek Thakur
34 Dockerizing the skin cancer detection web application
Dockerizing the skin cancer detection web application
Abhishek Thakur
35 How to train a deep learning model using docker?
How to train a deep learning model using docker?
Abhishek Thakur
36 Building an entity extraction model using BERT
Building an entity extraction model using BERT
Abhishek Thakur
37 Train custom object detection model with YOLO V5
Train custom object detection model with YOLO V5
Abhishek Thakur
38 Talks # 7: Moez Ali: Machine learning with PyCaret
Talks # 7: Moez Ali: Machine learning with PyCaret
Abhishek Thakur
39 How to convert almost any PyTorch model to ONNX and serve it using flask
How to convert almost any PyTorch model to ONNX and serve it using flask
Abhishek Thakur
40 Hyperparameter Optimization: This Tutorial Is All You Need
Hyperparameter Optimization: This Tutorial Is All You Need
Abhishek Thakur
41 I finally got a copy of "Approaching (Almost) Any Machine Learning Problem"
I finally got a copy of "Approaching (Almost) Any Machine Learning Problem"
Abhishek Thakur
42 Captcha recognition using PyTorch (Convolutional-RNN + CTC Loss)
Captcha recognition using PyTorch (Convolutional-RNN + CTC Loss)
Abhishek Thakur
43 Live Q&A: Getting Started With Data Science
Live Q&A: Getting Started With Data Science
Abhishek Thakur
44 WTFML: Simple, reusable code for PyTorch models
WTFML: Simple, reusable code for PyTorch models
Abhishek Thakur
45 Talks # 8: Sebastián Ramírez; Build a machine learning API  from scratch  with FastAPI
Talks # 8: Sebastián Ramírez; Build a machine learning API from scratch with FastAPI
Abhishek Thakur
46 Data Science PC Configs: From Low Range to Super-High Range
Data Science PC Configs: From Low Range to Super-High Range
Abhishek Thakur
47 BERT Model Architectures For Semantic Similarity
BERT Model Architectures For Semantic Similarity
Abhishek Thakur
48 I just got access to GitHub's Codespaces and it's amazing!
I just got access to GitHub's Codespaces and it's amazing!
Abhishek Thakur
49 Talks # 9: Vladimir Iglovikov; Detecting Masked Faces In The Pandemic World
Talks # 9: Vladimir Iglovikov; Detecting Masked Faces In The Pandemic World
Abhishek Thakur
50 Tips To Build A Good Data Science / Machine Learning Project (For Your Portfolio)
Tips To Build A Good Data Science / Machine Learning Project (For Your Portfolio)
Abhishek Thakur
51 Docker For Data Scientists
Docker For Data Scientists
Abhishek Thakur
52 How To Become A Data Scientist In 1 Year (Learn From A Real World Example)
How To Become A Data Scientist In 1 Year (Learn From A Real World Example)
Abhishek Thakur
53 Talks # 10: Tanishq Abraham; What are CycleGANs? (a novel deep learning tool in pathology)
Talks # 10: Tanishq Abraham; What are CycleGANs? (a novel deep learning tool in pathology)
Abhishek Thakur
54 Deploy Any Machine Learning Or Deep Learning Model On Google Cloud Platform (App Engine)
Deploy Any Machine Learning Or Deep Learning Model On Google Cloud Platform (App Engine)
Abhishek Thakur
55 Pair Programming: Deep Learning Model For Drug Classification With Andrey Lukyanenko
Pair Programming: Deep Learning Model For Drug Classification With Andrey Lukyanenko
Abhishek Thakur
56 VS Code (codeserver) on Google Colab / Kaggle / Anywhere
VS Code (codeserver) on Google Colab / Kaggle / Anywhere
Abhishek Thakur
57 Talks # 11: Jean-François Puget; Did you know GPUs are not just for Deep Learning?
Talks # 11: Jean-François Puget; Did you know GPUs are not just for Deep Learning?
Abhishek Thakur
58 End-to-End: Automated Hyperparameter Tuning For Deep Neural Networks
End-to-End: Automated Hyperparameter Tuning For Deep Neural Networks
Abhishek Thakur
59 Deploy Any Machine Learning (or Deep Learning) Endpoint on Google Cloud Platform In 10 minutes
Deploy Any Machine Learning (or Deep Learning) Endpoint on Google Cloud Platform In 10 minutes
Abhishek Thakur
60 Ensembling, Blending & Stacking
Ensembling, Blending & Stacking
Abhishek Thakur

PyTorch-TabNet is a deep learning architecture for tabular data that outperforms XGBoost. The video explains the architecture and its components, and demonstrates its implementation and usage for various tasks. The speaker discusses the use of attention transformers, feature selection, and explainability, and provides tips for fine-tuning and optimizing PyTorch-TabNet models.

Key Takeaways
  1. Implement PyTorch-TabNet for tabular data
  2. Use attention transformers for feature selection
  3. Apply explainability techniques to deep learning models
  4. Fine-tune PyTorch-TabNet models for specific tasks
  5. Use pre-trained models as a starting point for fine-tuning
  6. Apply techniques for handling data imbalance and regularization
  7. Optimize hyperparameters for PyTorch-TabNet
  8. Use PyTorch-TabNet for classification, regression, and multitask regression tasks
💡 PyTorch-TabNet is a powerful deep learning architecture for tabular data that can outperform XGBoost and other traditional machine learning models.

Related Reads

📰
Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
📰
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
📰
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
📰
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →