Serverless NLP Model Training

Data Skeptic · Intermediate ·📐 ML Fundamentals ·6y ago

Skills: ML Pipelines90%Supervised Learning60%

Key Takeaways

The video discusses building a serverless, scalable, generic machine learning pipeline with a focus on NLP model training, covering architectural solutions and design choices.

Full Transcript

[Music] hi I'm Alex Reeves I work at data skeptic as a data guy that wears a lot of hats data scientist sometimes data engineer other times just whatever the situation calls for I got my PhD in neuroscience from UCLA at the end of 2016 and knew that I wanted to dive into data science right after that the story of how I got in touch with you Kyle right after I got my PhD I did work as a postdoc which is like what you do in academia after your PhD and it was basically just finishing some projects that I began during my PhD meanwhile looking to get into data science the industry side of things you know out of academia and into it industry side there are many listeners either currently in grad school or in a postdoc role like you were who are thinking about the same transition do you have any thoughts about what they should know going into that that they might not know now oh yeah so first I think you just have to have some confidence in your abilities it can be a little nerve-racking to jump from academia into data science because you're not sure how your skillset translates directly into data science that's how I felt but I just kept telling myself you know I figured out a lot of very difficult concepts during my PhD I think that skillset in itself is you know a testament to my ability to learn things and so if I just keep at it things should work out and so I just had faith that they would yeah like practically speaking what really helped was things like translating my skillset into data science related skills you know taking my CV like curriculum vitae I'm not sure the pronunciation on that and turning it into a resume that's relevant for data science was a good exercise in kind of identifying what I had and what I didn't have going forward taking the parts that I didn't have so like I didn't have a lot of experience using github and you know making commits on repositories so I started building a portfolio of projects on github where I could demonstrate you know my ability to work with repos and also demonstrate programming ability because I knew that was something that is hard to convey in a resume and so I've made an effort to have some projects that people could view on my github to you know give them confidence that I was capable as a programmer and I made sure the projects were you know not your standard Titanic like if you go on Kaggle the data science competitions website you know the first tutorial they have you go through is the Titanic data set and do an exploratory data analysis but that's been done so many times I made sure I didn't do a project like that I did something a little more custom like one of them was like to extract all the relationships between ingredients in a cookbook thing and show a network of which ingredients go best with each other I did another one where I pulled a bunch of reddit comments and figured out which ones hadn't been removed by moderators and which hadn't and turned it into a tight classification project where I predicted whether a future comment would be removed and so those were custom enough where I felt like I had to do some thinking about data problems and so I was a you know a proud to put that on the github and felt like that that could show my abilities to be creative and a program one interesting thing about your NLP project that you're just mentioning is it was in what we might call the BB the before birthdays yeah which you've obviously got some exposure to can you maybe talk a little bit about how you approached the problem before such a tool was available and maybe how innovative that was to the type of work that you've been doing what I was doing at that point was just reading through a lot of what people had done in similar text classification competitions on kaggle and what that amounted to was constructing word frequencies vectors like constructing tf-idf s for let's say like a limited subset of 3,000 unique words in a dataset and then using various algorithms on the tf-idf vectors to construct classifiers for many many documents let's say like each reddit comment is a document or a row in this data set and the reddit comment has associated with it a relative frequency for each word in the vocabulary we're considering like 3000 word vocabulary we're considering and yeah yes no it's it's changed since birth though yeah so we're gonna talk about one of the big projects you've delivered in the last year to kick things off maybe can you give a summary of what we were trying to accomplish and sort of the high level of what we built I think we're talking about the server list machine learning pipeline that we built for the chat bot that we're working on right I think we started talking about it back in May and it all began with the excitement over the bird language model because it could just generate these great feature vectors automatically the bird language model could generate these six length vectors that had a lot of great features just embedded inside of them and it was kind of like domain agnostic to these vectors like it was really exciting to think about possibilities because it could handle so many different kinds of text yes so it's interesting to kind of make a comparison there between the BIRT vector values and the tf-idf frequency vector you had before it's like the tf-idf SAR kind of looking at single words at a time whereas the embeddings are considering you know the context of the words not just what word is in the sentence but what other words are around that word in the sentence and so that added layer of complexity is a much closer approximation I feel like to the way that we use language right if there's more I don't know if you call it understanding but it's got more of an understanding component to the embedding than a tf-idf where it's just kind of counting up the words in the sentence yeah you made a really interesting point that we're kind of trading off interpretability for complexity that the tf-idf vectors you could look at them and you know exactly how they were calculated and how to to certain degree interpret them actually to a fairly good degree I would say where BIRT vectors it's like oh these are just magic numbers that I trust because they work really well yeah so naturally the that model has to perform better otherwise why would we take the trade off what are your thoughts on the approach to that trade off like how much improvement did we see I know we didn't do a strict a bee test in this project but roughly speaking since you've worked in both worlds what sort of user experience someone interacting with a model how much improvement would you notice between a maybe a tf-idf vector based approach and a bird based approach I mean there are a lot of factors to consider about like how to make the comparison completely fair but if I'm just kind of going off of you know what the experience was like like like what you said what the experience was the verb raised approached took much much less effort like there was no real grid searching happening it was just kind of the default settings for an X G boost algorithm oh so I should I should mention the comparison that I did was take an approach a tf-idf a plus XG boost approach to the reddit comet data set where I'm trying to predict whether the reddit comment was going to be removed or not similarly take bird vectors generated from the read reddit comments and then constructing an extra boost model that uses the bird vectors to classify or infer whether a reddit comment would be removed so the difference between you know constructing all the tf-idf vectors and considering the different feature engineering approaches like how to handle like whitespace and formatting characters that took a lot of time with the tf-idf stuff the bird based approach basically took very almost no consideration on my part other than to generate the vectors and load them into the extra boost algorithm took much much less time to get a very similar result in the end I think the AEC's were both around 0.8 and it took significant effort on my part to get the tf-idf up to that whereas with bird it was basically first time let's talk through them the core implementation before maybe the serverless component it's just using Bert I know we've been talking about a lot on the show but I'm sure not all listeners have gone and tried it out if today's the day for somebody what do they need to do where do they go what do they do in Jupiter note book and what can they expect I thought a very smooth introduction to using Bert was the Bert as-a-service repo the readme for that is just excellent in terms of getting you set up what the repo does is it just creates a server client arrangement that uses tensorflow serving in the background I think to serve inferences from the Bert model you download clone the repo and you download a copy of the Bert language model from Google and follow the instructions on the repo and you know within 30 minutes you'll be serving Bert vectors from sample sentences it's just a matter of figuring out how to transform text that you have in like a if you've got a CSV of text records and some label that you want to apply to them you can throw text records in you know 200 at a time into this Bert as a service and get back chunks and build up a data set of Bert vectors that way and then you've got your CSV of Bert vectors and labels and you can run a classifier algorithm on that data set it's not too bad if you use that or it as a service repo and the Google language model I think there's two versions can you talk a little bit about the motivation for which one to use and how that affects the development process I elected to try to get this all working on my local machine just because I was at that stage in my development of technical skills and so that meant using the regular size Bert model not the large model even with just the regular Bert model I think that took about four gigs six gigs of memory maybe even eight gigs and then I couldn't even load the large bird model into memory so I think I would have needed a GPU for that and that means setting up for example like an ec2 with the GPU and AWS which we might experiment with that but at the time so we threw all this kind of research you were doing uncovered that Bert was a great tool for some of our use cases in particular for what is essentially a custom use case or more of an on-demand use case contrary to a lot of machine learning problems we didn't have necessarily know our data set to begin with but we wanted to bring the power of machine learning to the table I think this is where the surrealist part of the machine learning pipeline comes in right so the first part of the pipeline was to just be able to take a data set that we're interested and convert it into a text data set and it converted into bird vectors and then get a model from that but tend to do it server lessly so that we can let lots of people do the same thing we did it with AWS service components yeah so well I think that's an interesting point about where we've gotten so far that we knew we want to kind of in a certain way democratize machine learning allow people to train their own intents so it's really neat that we can kind of allow those people to train up some models without too much machine learning background what do they really need to know if they want to use this tool I think they need to know a little bit about what constitutes a good data set so if if they can put together a CSV where they've got two columns one column is a column of text and any other column is the label if they can do that much construct a data set that looks like that you know starting with a couple hundred would be good more date is better depending on how many different labels if they do that much that's probably enough to get a model out of the system so historically I don't know that there's a hard and fast rule of this but a lot of rules of thumb and heuristics not too many years ago people would say to do any NLP problem that's non-trivial you needed about a hundred thousand examples how come you're saying we only need a hundred now oh yeah that's the power of transfer learning we've got extremely powerful bird language model that does a lot of the legwork in terms of just generally understanding how the intense work and what's the key features that differentiate different pieces of text so because we generate great features automatically using Bert the data sets can be smaller thanks to this week's sponsor brilliant org I assume most of you are already on it checking out the problem of the day and stuff like that so today think about the people in your life who might not get beyond Brewer org and do yourself a favor knock someone special right off your holiday to-do list give the gift of brilliant by visiting brilliant org slash data skeptic give the gift of a brilliant premium subscription for me personally this woulda topped any gift I ever got around the holidays brilliant is a fun way to nurture curiosity build confidence and develop problem-solving skills Bereans thought-provoking content breaks up complexities into bite-size understandable chunks that will lead you from curiosity to mastery so head over to brilliant org slash data skeptic help spark a lifelong love of learning at brilliant org slash data skeptic so you took on this challenge of putting together a essentially no code tool to build machine learning models where people can provide some formatted file like the CSV with the two columns and here are all my examples and essentially hit a button and say go let's get into the implementation of that while the user might not have to worry too much about it algorithmically what's happening under the hood say that they have the CSB and say it's located here then we run a series of checks to make sure that okay did they say where the object is located do we have access to it you know just basic checks like that and if all that checks out we begin this batch job generating dirt vectors for each row we're expecting like each row to have a text column and a label column we transform the text column into burnt vectors row by row and as the rows get transformed once it's fully transformed throw that into an XG boost algorithm it could be another algorithm but XG boost has given us great results so far we do like five fold cross validation once that model is built we tickle it and put it into s3 subdirectory that we can retrieve for them later when they want to actually use the model the training pipeline like it begins with the drop of the CSV and then there's some you know updates and at the end of it out pops a nice machine learning model what does that model then take as input and mortor its labels when they want to use the model we'll load it into either like lambda function or into instance within a ECS cluster and let make inferences from the model and what they're going to give us are snippets of text and then we'll have a way of transforming that text into bird vectors kind of in an online fashion and passed those online transformations of vectors into the pickled text graphic our model that we build for them out generates a label based on the labels that were given to us in the original CSV of you know what it infers that piece of text should be labeled as so one of the challenges you had was that this had to be a general purpose tool you didn't know the intense in advance or anything like that that works well obviously for anyone to do this kind of no code real-time training what are some of the ideas you're thinking about if we want to provide a little bit more complexity and customization you know what are some of the levers and knobs that you might want to pull if you were trying to train a more specialized model for yourself a popular approach for improving the performance of a data set is to augment the data set like with image classification it's a common approach to increase the size of a data set almost artificially is to you know do like sheering transformations to the images in a data set that just means like stretching the images and kind of warping the images what you could do similarly would be to take a model like gp2 like a generative model where it doesn't give back vectors like numbers it gives back text it's kind of a reply to your text and you could augment your data set with the replies from like a generative model like GPT to so that you have more instances of each kind of class for your labels that kind of automatically improve the performance of your classifier model yeah very interesting idea the other thing I want you to keep thinking about will probably have you back on the show when you figured it out is how can we make this even easier by taking away the need to provide the intense mmm yeah yeah that's interesting like some kind of clustering approach exactly yeah let's talk a little bit now about the implementation I think anyone who is familiar with machine learning in any hands-on way whatsoever knows how to do a you know a fit and then use a predict method let's start from there what are the technology and implementation steps getting out of jupiter notebooks i would say is the first step and running python scripts instead of running it out of a jupiter notebook that goes a long way towards reproducibility and then the next step for reproducibility after you've got a bunch of python scripts and that can do what you need to do in terms of generating bird vectors or a classifier model is to get that working inside of a docker container super important to get things running inside of a docker container because that means you can run that same you can expect to run that same process in the cloud in a reproducible way or on other people's machines too so once you've got a docker container that can run the same machine learning pipeline that you're running in your jupiter notebook you've won half the battle in my opinion and the last half of the battle is being able to spin up that process on demand in the cloud and for us it was using AWS batch as the container Orchestrator for figuring out how to manage all the requests for running that docker container that you built with your machine learning pipeline inside of it so AWS batch is a it's been around for a little bit but it's a relatively new and not super widely known tool can you describe what that is yeah so i did we just batch the way i think about it it's basically a serverless system for like in our case for building a model but it could be for all kinds of docker processes taking a bunch of requests managing that request inside of a queue and then going through that you of request to assemble the compute resources that you need for running that docker container like Ferber you need a bunch of RAM to load the model into memory it can take a little bit of times to assemble those compute resources but Amazon does that for you you don't have to worry about managing that once it's assembled the compute resources it loads your docker container that has your machine learning pipeline process inside of it and runs it inside that process we're loading the result of the model building process the pickled model into s3 so that we have kind of the results of the modeling oh and also we're writing to a DynamoDB record throughout the whole process so that if something goes wrong we know what happened and when things go right we can report key metrics about the model inside of the DynamoDB and just to summarise AWS batch is basically a service for spinning up ec2 instances running a docker container inside them and spinning them back down so that's a little weird in a sense I mean ec2 is a server why do we get to call this a server list process yeah no that's that's a good point you know I don't want to say it's a misnomer this whole server list word but you know it doesn't mean there's no server at the end of the day when you first hear the word it could be confusing kind of like how we use the term Wireless almost it's like there are servers you just don't have to worry about managing them at the end of the day there's a machine in a compute centers what are they called yeah your code is running on a machine inside there so there's there's a server it's just server loose in the sense that you don't have to think about provisioning or managing that server and why was AWS batch an interesting choice for you in this project it's able to assemble larger compute resources compared to something like a diverse lambda and lambda is like the poster child of the whole serverless movement there you're running functions as a service but it's got these memory and time execution limits that make it ill suited for machine model building processes because those processes can take a while especially when you're dealing with these huge language models that we use for transfer learning the memory limits become an issue quickly we needed big compute resources to take advantage of these huge language models and batch is the offering from AWS that lets you request things like two virtual CPUs and ten gigs of RAM that's why we went the batch route versus you know trying to do it all with lenses so a job is matched like give you access to tons of compute if you need it like if lots and lots of people there's a burst of activity of interest in using this machine learning pipeline service then Atos can scale to meet that demand but likewise they've managed things so that they can scale all the way down to zero to so if there's no demand for whatever reason like people just are toying around with their models that they just built and so there isn't a much activity they can scale these instances these easy to use all the way down to zero so that you know we're not being charged for it which is great yeah that's a very exciting part of it for me because especially the usage and I know I haven't shared some of this data with you but the usage pattern we see is a single user will come on build a bunch of models and then go away for a while come back and do that all over again so we do see that bursty sort of aspect and we're anticipating more of that so that and especially the spin down if no one's in use in the middle of the night those made a great solution for us I think yeah yeah for sure I mean yeah the instances get pretty big when you load these language models into memory so having them on all the time could it could run up the bill pretty quickly so yeah so cool to wrap up maybe let's run down the tool chain we used here and just comment on where they fit in the pipeline so you'd already mentioned Jupiter what role does it play Jupiter was great for the EDA side of things like when we were first looking at how the burt pipeline was going to compare to a tf-idf approach and visualizing the tribution of bird vectors things like that just the initial part of getting the project to work and to look at the initial data Jupiter was great for that and I mean it goes without saying and we've covered AWS batch how about docker yeah so docker was great for being able to reproduce the machine learning pipeline on my machine on-demand on your machine on demand and then to have AWS host the process and for them to reproducibly begin the process in the cloud so yeah what about storage where it is everything live do we have metadata on the models for the model storage it's all s3 and then for recording what happened during the model building process both online and then once the models built we have individual records for each model building process in DynamoDB that was great for development actually because we had a nice system for recording errors sometimes visibility is a little bit difficult when you're working in the cloud and trying to get things running in the cloud wow that's a whole other show man yeah but a short-story DynamoDB for all the record-keeping for sure and last but not least terraform oh yeah yeah we didn't even talk about terraform but that's been a fantastic tool for managing all these serverless components I don't know everyone will know what it is can you give a quick definition before how you used it it's kind of like docker almost so docker you have these docker files where you write out all the commands that you would need to run in order to build your operating system and the applications you need for running the process similarly with terraform you've got these terraform scripts and inside there are basically these big configuration files that specify the AWS component that or it doesn't have to be a deal so you can be yeah quick footnote on that the whole show we've been talking about AWS not because it's a commercial but because that's what we happen to use pretty much all this we could have done anywhere we're vendor agnostic but anyway started in a row yeah so it's you list out all the resources that you want to build in your cloud provider you can run this terraform script and tear forum knows how to build that out in your cloud account what's great is you can do that reproducibly so I had this awesome experience where you know I worked really hard to get this Terra for I'm a configuration file ready for the machine learning pipeline and I had it working on my development video base account but then we decided it was time to move it into production and so I just sent you this terraform script and the repo that we used for the pipeline and dockerfile between those three things you had it in the production system working in about an hour and that was just I think the time it took to build out all the resources in the cloud and get them loaded so it was just like this magical moment for me where I saw the power of the infrastructure it's like the term is infrastructure is code I saw the power of having it all written out in code at that moment it was pretty cool yeah thanks to Hoshi Corp for helping change management engineers you lose their jobs everywhere no but in all seriousness its standing on the shoulders of giants that's a powerful tool yeah okay well to close it out Alex you've been spending all this time building training models now that the project is pretty much complete what are your thoughts on a good use case what would you like to build for real this time using the tool you've built oh yeah I haven't spent a ton of time thinking about this but it would be fun to create a bot that kind of to go back to the reddit comment moderation' just the first thing that comes to mind is to build a bot that can automatically moderate message board channels so it could be ready it could be something else but some kind of way to manage help manage and enforce the rules of a message board I think that would be a lot of fun very cool yeah let's explore that and see what happens when anyways Alex it's been a tremendous pleasure working with you on this project and many others thanks for coming on the show and sharing your experience yeah likewise Kyle it's been a pleasure I expect there will be many more adventures to come and thanks [Music]

Original Description

Alex Reeves joins us to discuss some of the challenges around building a serverless, scalable, generic machine learning pipeline. The is a technical deep dive on architecting solutions and a discussion of some of the design choices made.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 0 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video provides a technical deep dive into building a serverless, scalable, generic machine learning pipeline for NLP model training, discussing architectural solutions and design choices. Viewers will learn how to architect a serverless ML pipeline and design a scalable NLP model training architecture. The video is relevant for intermediate learners in the field of machine learning.

Key Takeaways

Define the requirements for a serverless ML pipeline
Choose a suitable serverless platform
Design a scalable NLP model training architecture
Implement data preprocessing and feature engineering
Train and deploy the NLP model

💡 Building a serverless, scalable, generic machine learning pipeline requires careful consideration of architectural solutions and design choices to ensure efficient and effective NLP model training.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related Reads

I Spent Days Building a Transformer. A 5-Line Model Beat It.

A simple 5-line model outperforms a complex transformer model, teaching a valuable lesson about model complexity and effectiveness

Medium · Data Science

Inside the Machine: How Monte Carlo Simulation Actually Works

Learn how Monte Carlo simulation works and its applications in finance, including pricing options and measuring risk

Medium · Python

Batches, Epochs, and Validation: What model.fit() Is Actually Doing

Understand the basics of model training in deep learning using model.fit()

Medium · Machine Learning

Batches, Epochs, and Validation: What model.fit() Is Actually Doing

Understand the role of batches, epochs, and validation in model training using model.fit()

Medium · Data Science

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB