Hugging Face Infinity Launch - 09/28

HuggingFace · Beginner ·🧠 Large Language Models ·4y ago

Key Takeaways

Introduces Hugging Face Infinity, a new inference product for Transformer models with 1ms latency

Full Transcript

hey everyone thanks for tuning in hey everyone thanks for the i think we have over 200 people here in the house we're gonna leave a couple more minutes for for people to show up and join us um so thank you very much for being early birds here uh well while people people join us we can do a little round of presenting ourselves so i'll start i'm jeff budiem the product team here at hiking face i'm happy to work with companies to accelerate their machine learning and uh super excited to show infinity to you today federico well i'm federico pascual i'm from the marketing team and yeah i'm pretty excited to tell you more about our new inference product morgan hi everyone we're happy to be there i'm machine learning engineer working on optimization there so yeah and philip hi everyone i'm phillip i'm tech lead at hiking face and also tech lead for infinity and like happy to demonstrate you what we have created today awesome i see we have people joining from all over the world don't hesitate say hi in the chat say where you from uh super excited to uh to see you all joining um while people continue to join maybe uh i'll give a few words of context about hugging face i'm sure most of you are familiar with us but to give you the the big picture our mission as a team is to democratize state-of-the-art machine learning and we do this through open science we do this we do it through our open source contributions and we do it also through our products and services and i'm super excited to show infinity to you today hey greg hey alexander hey peter we got new york germany bay area bangalore spain germany again uh boston berkeley minneapolis amsterdam korea italy boulder munich turkey awesome thank you thank you so much all for joining um all right let's see maybe one more minute so one more minute um so i'm sure you're familiar with uh transformers today there are over 16 000 state-of-the-art transformer models available on our website at huggingface.go or hf.co for short 1600 free and public data sets that you can access through our open source and through our machine learning platform so feel free to check it all out today i'm going to start sharing my screen and then federico you give me the the go ahead to start inception all right whenever you're ready you let me know yeah let's go let's start sweet all right let's get this show on the road so infinity before i get too deep uh into it i just want to restate uh what it is that we want to show you today so infinity is a containerized solution that you can deploy in your own infrastructure whatever cloud you're using whatever production environment you have so you can deploy fully optimized and accelerated inference pipelines for transformer models so now you know what it is all about and in this uh session what i'm gonna go through is first uh show you what infinity is why we built it and how it works uh and then philip is going to walk you through a live demo of the service to show you how you can do uh how you can deploy infinity in your own environment then we'll have a little time for q a all right let's get to it before let's start with a little bit of a guessing game so here it is what do tesla gmail facebook and bing all have in common there are a lot of answers possible but the one i'm interested in today is transformers it's through transformers that tesla drives the car it's through transformers that gmail will complete your sentences it's through transformers that facebook automatically translates the post from your friends and it's through transformers that bing answers your questions when you ask them in natural language and then there is another piece to this which is all of those services all of those services um run billions of predictions on transformers models every day and that's a huge technical and engineering feat that big tech has been able to achieve what infiniti is all about is to enable this for the rest of us we want every single company in the world to be able to take advantage of transformers models the accuracy of transformers models but in a way that they can deploy at scale if as efficiently as possible to enable real-time use cases um and high performance as facebook as tesla as google and microsoft have been able to achieve and what we've seen discussing with hundreds of companies that use our open source every day is that transformers today it's still at the research and development stage for a lot of companies out there evaluating the models building new use cases based on those models that are still issues that prevent those models to be deployed in production on customer data for everyday real-time use and those issues are the efficiency the time it takes to make predictions on those models and the difficulty of building that into infrastructure that scales so today if you want to do that as a machine learning engineer you can either use black box solutions where you cannot bring your own model or fine-tune the model or understand what's under the hood or you can start from scratch and rely on uh great open source projects out there great uh cloud machine learning platforms out there so you have to cobble a solution together and that's really really hard to get right and to get the latency down to that sweet spot of 20 to 50 milliseconds per prediction that really enables scale that really enables real-time use cases and i see three layers of uh engineering challenge in doing so so the first one is those models the transformers models are large models uh they're hard to deploy they require special memory requirements special hardware requirements and so the deployment of the model is the first challenge to tackle but then if you want to get better performance you have to actually edit your model you have to do things like compressing the model you have to do things um like pruning quantizing and all the methods uh through which you can accelerate the performance the efficiency of your model um um directly but then to squeeze uh the best possible performance and get the accuracy benefits of transformers while having the speed that is compatible for large-scale uh workloads then you have to go all the way down to the hardware understand for the actual cpu or gpu that you're going to be running the model on what operators are available and that low level optimization is what enables the performance that we're delivering when talking to companies we observe that typically it takes two months for teams of very highly skilled engineers to deploy transformer models and accelerate them to get that level of speed and these uh type of experience and skills are very rare within companies so it's a big challenge for organizations and because we at hugging face are at the center of the ecosystem for transformers transformer models then our users are looking for us to solutions whether it's the financial services industry or the healthcare industry or the consumer tech industry they're all coming to us uh with requests to accelerate uh their models and because we are the center of the ecosystem of transformers we also benefit uh from deep collaboration with open source uh leading uh organizations that enable us to tackle these uh challenges um at the very lowest of levels uh through our partnership with uh unexpected time through our partnership with intel and with nvidia open source so now how does it all work um so that's what it looks like in a nutshell infinity is a plug-and-play solution that will deliver transformers accuracy at millisecond latency what it looks like is a container a docker container an infinity container that wraps all the optimized logic to deliver an end-to-end inference pipeline including the pre-processing the post-processing and all the model optimization around your model it provides you with easy easy-to-use http endpoints to run predictions and monitor performance and it also comes with what we call the multiverse container which is a self-serve service for you to optimize your own model to make it compatible with the infinity solution so you can update retrain uh your models and still work within your environment without having to share your model with us without having to send data out of your environment and infinity runs in every environment where you can run a docker container whether it's your aws or it's your azure your gcp even in saves maker or in your on-premise data center anywhere a docker container can run so to sum it up infinity is an optimized solution delivering an end-to-end inference pipelines leveraging all the state-of-the-art techniques that we've pioneered through our science and open source contributions it is fully deployable in any environment with us in a single container that you can scale um as uh containerized solutions and what's really unique about infinity well first is a performance right that's the reason why we went out to build it but then it's also a solution that is completely flexible where you can bring your own model um you can fine-tune your own model to work within infinity it runs within your own environment so it's out of the box enterprise ready and secure and you're in control of it you control your model you control your data you control your production environments so what can you use it for we really wanted to focus on the largest volume use cases that we saw through usage of our open source and today that means focusing on the model architectures and really working deep on those specific model architectures to be able to provide that acceleration so semantic search is a huge a a huge use case where we can see uh bert like models being used to extract embeddings and do ranking and then all tasks that you can perform through vert-like models from text classification to entity recognition sentiment analysis and other tasks so let's let's look at the results we're able to achieve now one millisecond that is how low we're able to pull down the latency on a bird-like model on gpu this is really breakthrough performance that will enable a whole new breadth of use cases for the industry not just for big tech but for every single company out there and that's not all on cpu we were able to achieve three milliseconds for bert inference latency which is amazing that means that you can run huge workloads and have a great cost of operation for for your infrastructure so i'm really excited for philip to demonstrate that to you in a few minutes but before let's talk about the amazing partners with whom we've worked over the past few months to really hone in on the infinity solution so auto one of the largest one of the largest e-commerce companies in the world looking uh two hugging face to build a new semantic search based experience we're able to achieve two milliseconds uh for the uh latency of uh their model and uh we have a little tape for you philip if you want to roll it so in there hi we are phillip and jens from otto since november 2020 we have been working as part of team archie to explore new opportunities of information retrieval systems we were able to reach around 2.3 milliseconds for the feature extraction task and around 2.4 milliseconds for the ranking task per data point this is about 100 times faster than a non-optimized model on cpu and 10 times faster than on gpu awesome and uh thank you so much uh jensen phillip for working with us on this there is a lot more uh of otto's story for you to discover if you go to our landing page at huggingface.com infinity you learn more about how otto has been leveraging hugging phase and infinity to achieve those results next i want to give a big shout out to pete at penny hey pete i think you're over here um with infinity moneypenny the largest uh outsourced conversation service in the world we're able to automate call transcript classification using your bird-like models at the tune of four milliseconds per call so philip if you can roll the tape let's hear what pete had to say hi well my name is pete hanlon and i'm the cto at moneypenny and we are the number one outsourced communications provider in the world so we were able to process a year's worth of conversations in just over an hour which is a massive improvement on performance awesome thanks pete and again a lot more from pete and manny penny if you go to our website at huggingface.com infinity so thanks to our pilot partners we're able to really hone in on what can you do when you're able to increase the performance to decrease the latency 10x like this a great engineering feed but if i'm a product manager what does this do for me so the first thing is if you're managing infrastructure and you have a large scale workloads and maybe you you use 20 gpus on a constant basis to run that workload well today with infinity chances are you only need two of those gpus and you can regain all that infrastructure all that compute back to power new features the second thing you can do with a 10x drop in latency is to increase the scale at which you're able to leverage transformers maybe today you use it to classify customer tickets um well with infinity or able to reach the scale that enables you to power many more features uh upon transformers because you have now 10 times more throughput and then the last thing is speed a 10x drop in latency means that experiences that were not possible before are becoming possible and in there any custom facing use case where latency is really important uh becomes possible through transformers all right now now you know why we built infinity you know how we built it and here's the question how can you get your hands on it so what's available today we're launching infinity in general availability with a wide set of model tasks and supported hardware in terms of models there are seven architectures that are available out of the box for birch distal birth roberta and mini lm downstream tasks in terms of tasks you can use infinity today to do embedding extraction and re-ranking which are instrumental to power semantic search and you can also use infinity for sequence classification tasks we are working on token classification it will be available soon let us know if this if there is a task in there that you need and then the last thing is hardware so i mentioned our partnership with intel and how we're able to really drill down to the silicon to be able to extract as much performance and efficiency as possible um we have a great solution that is optimized specifically for the latest generation of xeon cpus cascade lake cpus so we recommend uh uh your cpn cpu cpu-bound workloads um to power on those but infinities also uh also supports uh skylake and earlier generations of xeon cpus we're working with intel to enable a great solution with new performance gains on ice leg generation and then of course on gpu uh the infinity solution is optimized uh to take advantage of tensor core technology of nvidia gpus and how do you get your hands on it well you should request a trial today so the url to do so is hugging face dot go slash infinity dash trial we will just need uh some uh information technical information to understand what is the tasks that you're targeting what is the model architecture that you want to put in production uh what hardware do you have available which production environment do you plan on deploying infinity in and then the the main thing here is that um you should do so as soon as possible because our sales team is going to go back to you to schedule technical conversations on a first-come first-served basis so yeah request a trial today it's huggingface.go slash infinity dash trial and with that i'm going to hand it off to philip who's going to walk you through a demo live on how you can deploy infinity in your own environment all right thank you so before we start i would like to show you how an infinity repository is going to look like so it's pretty similar to what we already have with all the transformer models uh infinity repository consists out of a config json which contains all the configuration we need to run the infinity infinity container on your dedicated hardware and infinity model.bin which is basically a pandora to the python model that bin and the tokenizer in this case it's like a mini lm model fine tune for ssd2 and already optimized for infinity which we are going to use to run on a gpu therefore just need to switch the screen and go okay so i'm on ec2 machine with an nvidia or not with an nvidia okay then let's start with the cpu okay i will we will look into the gpu later when my console is working again so with the cpu um we are also achieving as jeff said like three milliseconds per task and in this case i already optimized the sweller model in the temp directory which containing our infinity model.bin our tokenizer json and our config json and to run infinity it's basically you can use docker run to run it and that's all you need so if you want to run it on a kubernetes cluster it's basically you need just the kubernetes configuration for it but on a normal virtual machine you can use docker run to run it and infinity provides options to use storage like remote storage in our case we are using a mounted file system into our container but um infinity support storage options for de-hugging phase hub amazon s3 and google cloud storage and after a few seconds our infinity container will be started and for cpu you can see there we are also out trending like the omp num threads to see okay if the container is like initialized properly and as you can see we are not running on the biggest cpu out there and in addition to this we have created maybe i can open another tab um and by the way keep the great questions coming we're responding to as many as we can during the demo i said okay that's bad i'm on the wrong machine um but i can guess i can start it in a detached mode and then we can see the request time one second okay now the infinity container is running and what we have done is we created a python script to show you to show you how you can run your requests on it it's a simple python script with an input of 16 token length and just iterating over it until we are done basically or not that's bad okay let's let's go back to the gtu and that should work sorry for it so we have a gpu here tensor core where we can run our infinity container with the model from the hugging phase hub and we can connect to it on an additional machine with ssh and should see our our python script here as well perfect okay just one second for the infinity container to be started and then i can show you the one millisecond for infinity on gpu it's running now we can execute our script with python3 requested pi and here we can see our requests coming in on the left side just logging on the right side we can see we have at the top and example output with our probabilities for our task we are running on ssd2 so it's sentiment classification and we can see the one millisecond 1.5 milliseconds for our input token and some of you might say yeah of course it's just 16 millisecond 16 tokens that's why you're that fast therefore we prepared another script with a sequence length of 128 here you can see the input of it so it's just like a text about heightened phase and we are achieving for 128 sequence lengths 2.5 to 2.6 milliseconds for gpu and while i'm setting up the uh last one i get back to jeff's to answer a few questions thanks so much philip wow 2.2.5 on gpu that's that's really amazing um okay so let me take some some of the questions that i saw uh so first i have a technical question i'm going to hand it over to morgan from cdart any limits on max token length yeah the token's length is mostly uh ruled by the model itself so if you have any bird model for instance they are mostly trained for 512 tokens and we will be looking at enabling some more longer range models in the future but at the moment uh bert is like always 512. nice thanks and then i have a question from vamshi also for morgan vampire do you also support models with decoder blocks what sort of performance gains should we expect for gpt and bar yeah so that's what we have on the roadmap at the moment so currently as i said we are mostly looking at dirt like architecture so auto denoiser we will of course look at a sequence to sequence model and it requires a little bit more of engineering works there and uh you can expect the same kind of um i mean if you take one inference call you can take um something around the same order of magnitude that we have for bird all right one more for you morgan a question from brian what cpu or gpu was used to achieve one millisecond and three millisecond latency with burp so for the gpu part we were using nvidia t4 uh gpu which are are mostly available uh on any cloud provider for inference uh and for the cpu part i think we're using like m5 instances on aws or c5 something like this which provide all the required knobs uh on the other side to enable very fast inference nice and uh i'm gonna take the the the easy one does it support semantic search right now yes absolutely you can build your semantic search upon the feature extraction task that is implemented out of the box in infinity and the re-ranking task so yeah it's fully supported today so i'm back with cpu sorry for like the coincidence but somehow like my terminal got disconnected i already started the infinity cpu container and getting back to like the question about which cpu we use so we are running on a m5 and large instance using two vcpus and a cascade lake cpu under the hood and you can see like the infinity api has started properly for my fine tune ssd2 and again i can use the python request script for it to achieve 2.4 to 2.5 millisecond for and the input of this live event is create i will sign up for infinity and to show you that like the cpu performance is also not getting very bad when scaling up we have another um yeah example script again with our longer sequence there you can see we are achieving under 10 milliseconds on cpu with two cpu cores for a sequence length of 128 okay i guess chef you can take over again yes i think i need to address the pricing question because i see uh many people who are asking about it so we want to make it super simple and we think the best way to do that is to price upon the number of containers that will be needed to run the workload uh that um that you're planning for infinity so through the trial uh and sign up for that trial at infinity trial through that trial we'll be able to work with you to provide you with a cus a container that works specifically uh for your task um and through that trial you'll be able to estimate how many containers you will need considering that one container means one model running on one machine so it's basically capacity and so yeah we'll set that up and our sales team will build that into a yearly contract so you can use infinity in your own environment in your own on your own tasks we don't have public pricing i'm not going to get into that but just to set orders of magnitude um the minimum amount for a contract will be uh around uh 20 000 a year for one container one cpu any uh any other question i have a question about how does infinity semantic search compare to china so actually these solutions can work um in full symbiotic symbiosis right you can use infinity to build the vector representation of your text and index that within a search engine the china being elastic whatever you're planning to use through infinity you're able to build this solution upon transformer based models at at uh unprecedented unprecedented speed sorry um i had some questions around tasks uh do we plan to support image image transformers do we plan to support machine translation so yes as you know transformers today conquered nlp but is fast expanding to other modalities showing state-of-the-art performance in speech and computer vision tasks like image classification object detection so yes absolutely we plan to support every downstream task that transformers that are implemented within the transformer open source library can support we won't get there uh instantly it will take us some time but we plan to add every single task that is supported by the library within the infinity solution i've i see a question about fine tuning so is model fine tuning being done in a similar way we do with the transformers library um i think the short answer is yes but philip do you want to expand yeah so basically you can like still fine-tune your model as you would do normally to run your infrared so you can use transformers and the open source libraries to fine-tune your model and afterwards you can use the infiniti moodle multiverse container to optimize and convert the model for the infinity container and then plug and play so you don't need to change your fine-tuning pipelines and fine-tuning scripts to use infinity at the end all right [Music] is there any other question that you guys want to take on um i see some great great questions over there uh thanks everybody for joining i hope that you are as excited as we are about the potential for infinity uh to democratize the use of transformer models in production at companies so we're excited to take on that adventure with you um if you want us uh to get back to you and set up a trial please sign up as soon as possible uh we're going to be taking in those requests on a first come first serve basis so the earlier you do the earlier we'll be able to accommodate you and set up that technical discussion uh yeah thanks federico for sharing that link so it's just one click away and i will take one more question is there going to be a video of this presentation uh yes there will be a video of this presentation to request it you can hit us up at infinity huggingface.com all right and so with this i think um i think we can wrap it up and thanks everybody for joining thanks thanks everyone everyone bye

Original Description

On this live event, we shared for the first time in public details about our new inference product: 🤗 Infinity It achieves 1ms latency on Transformer models 🏎 and you can deploy it in your own infrastructure 🚀 If you'd like to see what 🤗 Infinity can do for your business, you can request a trial today to test Infinity in your own infrastructure 🚀 We will set up trials on a first-come, first-served basis, so the earlier the better! Learn more about Infinity: https://hf.co/infinity Request a trial: https://hf.co/infinity-trial
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from HuggingFace · HuggingFace · 53 of 60

1 The Future of Natural Language Processing
The Future of Natural Language Processing
HuggingFace
2 Trends in Model Size & Computational Efficiency in NLP
Trends in Model Size & Computational Efficiency in NLP
HuggingFace
3 Increasing Data Usage in Natural Language Processing
Increasing Data Usage in Natural Language Processing
HuggingFace
4 In Domain & Out of Domain Generalization in the Future of NLP
In Domain & Out of Domain Generalization in the Future of NLP
HuggingFace
5 The Limits of NLU & the Rise of NLG in the Future of NLP
The Limits of NLU & the Rise of NLG in the Future of NLP
HuggingFace
6 The Lack of Robustness in the Future of NLP
The Lack of Robustness in the Future of NLP
HuggingFace
7 Inductive Bias, Common Sense, Continual Learning in The Future of NLP
Inductive Bias, Common Sense, Continual Learning in The Future of NLP
HuggingFace
8 Train a Hugging Face Transformers Model with Amazon SageMaker
Train a Hugging Face Transformers Model with Amazon SageMaker
HuggingFace
9 What is Transfer Learning?
What is Transfer Learning?
HuggingFace
10 The pipeline function
The pipeline function
HuggingFace
11 Navigating the Model Hub
Navigating the Model Hub
HuggingFace
12 Transformer models: Decoders
Transformer models: Decoders
HuggingFace
13 The Transformer architecture
The Transformer architecture
HuggingFace
14 Transformer models: Encoder-Decoders
Transformer models: Encoder-Decoders
HuggingFace
15 Transformer models: Encoders
Transformer models: Encoders
HuggingFace
16 Keras introduction
Keras introduction
HuggingFace
17 The push to hub API
The push to hub API
HuggingFace
18 Fine-tuning with TensorFlow
Fine-tuning with TensorFlow
HuggingFace
19 Learning rate scheduling with TensorFlow
Learning rate scheduling with TensorFlow
HuggingFace
20 TensorFlow Predictions and metrics
TensorFlow Predictions and metrics
HuggingFace
21 Welcome to the Hugging Face course
Welcome to the Hugging Face course
HuggingFace
22 The tokenization pipeline
The tokenization pipeline
HuggingFace
23 Supercharge your PyTorch training loop with Accelerate
Supercharge your PyTorch training loop with Accelerate
HuggingFace
24 The Trainer API
The Trainer API
HuggingFace
25 Batching inputs together (PyTorch)
Batching inputs together (PyTorch)
HuggingFace
26 Batching inputs together (TensorFlow)
Batching inputs together (TensorFlow)
HuggingFace
27 Hugging Face Datasets overview (Pytorch)
Hugging Face Datasets overview (Pytorch)
HuggingFace
28 Hugging Face Datasets overview (Tensorflow)
Hugging Face Datasets overview (Tensorflow)
HuggingFace
29 What is dynamic padding?
What is dynamic padding?
HuggingFace
30 What happens inside the pipeline function? (PyTorch)
What happens inside the pipeline function? (PyTorch)
HuggingFace
31 What happens inside the pipeline function? (TensorFlow)
What happens inside the pipeline function? (TensorFlow)
HuggingFace
32 Instantiate a Transformers model (PyTorch)
Instantiate a Transformers model (PyTorch)
HuggingFace
33 Instantiate a Transformers model (TensorFlow)
Instantiate a Transformers model (TensorFlow)
HuggingFace
34 Preprocessing sentence pairs (PyTorch)
Preprocessing sentence pairs (PyTorch)
HuggingFace
35 Preprocessing sentence pairs (TensorFlow)
Preprocessing sentence pairs (TensorFlow)
HuggingFace
36 Write your training loop in PyTorch
Write your training loop in PyTorch
HuggingFace
37 Managing a repo on the Model Hub
Managing a repo on the Model Hub
HuggingFace
38 Chapter 1 Live Session with Sylvain
Chapter 1 Live Session with Sylvain
HuggingFace
39 Chapter 2 Live Session with Lewis
Chapter 2 Live Session with Lewis
HuggingFace
40 The push to hub API
The push to hub API
HuggingFace
41 Chapter 2 Live Session with Sylvain
Chapter 2 Live Session with Sylvain
HuggingFace
42 Chapter 3 live sessions with Lewis (PyTorch)
Chapter 3 live sessions with Lewis (PyTorch)
HuggingFace
43 Day 1 Talks: JAX, Flax & Transformers 🤗
Day 1 Talks: JAX, Flax & Transformers 🤗
HuggingFace
44 Day 2 Talks: JAX, Flax & Transformers 🤗
Day 2 Talks: JAX, Flax & Transformers 🤗
HuggingFace
45 Day 3 Talks JAX, Flax, Transformers 🤗
Day 3 Talks JAX, Flax, Transformers 🤗
HuggingFace
46 Chapter 4 live sessions with Omar
Chapter 4 live sessions with Omar
HuggingFace
47 Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker
Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker
HuggingFace
48 Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker
Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker
HuggingFace
49 Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker
Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker
HuggingFace
50 [Webinar] How to add machine learning capabilities with just a few lines of code
[Webinar] How to add machine learning capabilities with just a few lines of code
HuggingFace
51 Hugging Face + Zapier Demo Video
Hugging Face + Zapier Demo Video
HuggingFace
52 Hugging Face + Google Sheets Demo
Hugging Face + Google Sheets Demo
HuggingFace
Hugging Face Infinity Launch - 09/28
Hugging Face Infinity Launch - 09/28
HuggingFace
54 Build and Deploy a Machine Learning App in 2 Minutes
Build and Deploy a Machine Learning App in 2 Minutes
HuggingFace
55 Hugging Face Infinity - GPU Walkthrough
Hugging Face Infinity - GPU Walkthrough
HuggingFace
56 Otto - 🤗 Infinity Case Study
Otto - 🤗 Infinity Case Study
HuggingFace
57 Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it
Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it
HuggingFace
58 Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models
Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models
HuggingFace
59 🤗 Tasks: Causal Language Modeling
🤗 Tasks: Causal Language Modeling
HuggingFace
60 🤗 Tasks: Masked Language Modeling
🤗 Tasks: Masked Language Modeling
HuggingFace

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance
Medium · LLM
A simple way to test model fallbacks with RouterBase
Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface
Dev.to · routerbasecom
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →