LLMOps for eval-driven development at scale

Weights & Biases · Intermediate ·🏭 MLOps & LLMOps ·11mo ago

Key Takeaways

The video discusses LLMOps for eval-driven development at scale, focusing on Mercari's experience with LLM apps, including model-centric, data-centric, and evaluation-centric approaches, and highlighting the importance of evaluation-centric approach for productionizing LLM apps. Tools like LLaMA, ARM APIs, and Vertex AI API are used for content moderation, metadata extraction, and evaluation.

Full Transcript

So nice to meet you guys. This is Shang Min. As I said, I'm based in Tokyo. I work in Mer Curry and I work in the the team called AIM team as a software engineer, but I doing a lot of stuff from ML to DevOps and other stuffs. Today I'm going to talk about apps and ARM ops actually. But I'm not a talk like a like we do have best practice you guys follow because you know this change this topic changes a lot. What's true today cannot be true tomorrow, right? So I'm just humbly share what we learned from our journey and you guys can pick it up something. Yeah. So I guess many of you guys never heard of Marquari before even though we have like US branch but we're basically mostly based in the Japan. We starting from Raj like C2C e-commerce platform now it's like largest C2 C2C e-commerce platform in Japan but rapidly expanding our businesses horizontally such as fintech crypto set ambu and monthly active user is about 23 million which is 98% of all e-commerce users in Japan so if you guys interested in Japanese stock market buy socks so I can sell my shares yep so let's go to the topic Adams development. So there are several ways to see this way but uh we primarily defined like three major components. First is model of course if you want to build the LRM apps you need a model right and then the data you need like user input pre-processing post-processing and prompt engineering of course like lagging some knowledge base also these days a lot of MCPS and lastly evaluation before you productionize your app you need to evaluate to know whether it's working or not right so with that being said we can take different approaches first one is like model centric of course you focus on the model which means you retrain or fine-tuning based on your purpose which makes sense. The second is datacentric. For example, you embed all your knowledge bases into vector database and use it as a lagging or you set up a lot of MCPS or you do a lot of post-processing or pre-processing of user inputs. And the last eentric this going to the point I'm going to talk about throughout all the slides actually requires Earentry according to our experiences. For example, I think you guys have heard of like Adam apps are easy to demo but hard to productionize. In this case like this kind of like a customerf facing chatbot is very ambitious not because Adam's performance but because also we don't know what's good and bad which means it's hard to evaluate right with that being said Adam absolutely requires a centric eentric approaches what do you mean by like building an evil which means that you build something to help build something so ears are like enablers don't like lack building them so when you do eBay ccentric you requires adds but don't get me wrong like people when we when people talk about adm they talk about like a very like complex stuff like you need a feature store you know ret like retraining pipeline that is triggered based on performance generation or like data decay you need analytic source monitoring and so on and so forth but that's that's also correct but the what's the point of this adds m offs like phenops and deos all the ops it's a faster restoration you need certain kind of process to be straight really fast. So with that being said just like software engineering we did those kind of process for when building apps first evaluating quality like you do in in software engineering you do like unit test right and second is debugging issues by using loggings or inspecting datas and second is the last is a changing behavior like writing core fine tuning and prompt engineering actually many people focus on like number three above but actually that's the right that's that prevents the the users I mean the developers from improving their add products beyond the demo. So once you have like very good evaluation pipeline everything's come along. So up next I'm going to talk a little bit about the add apps we're using up and running in Merkari also about our team. So as I said Mari is a C2C e-commerce company. There are like many core teams but two major core teams are like secular experience and buyer experience at the same time since it's like C2C e-commerce it's like high probability that user can up like list an item that is illegal or suspicious so we pay m much attention to TNS trust and safety also we provide a fintac or payment service also not only the cult item we also allow business users list their item to provide a seamless buying experience to users also crossers which means that you can buy an item out of Japan. So along with this core teams, my team ARM team work closely with this core team maximize their business impact mainly using JAI. So here are some add up and running right now. First one is content moderation. We actually traditionally using loombased model also ML models for context moderation but we trying to combinate with add maps with this like ML models which I will talk about later and also for seller experience we do have two major feature metadata extractor and the AI listing what they're doing is actually same but it's a bit different in terms of the timing this metadata extractor is a synchronous so we extract some metadata based on the images and then suggest the users to change their metadata. If item is not sold for like a week or something. But AI listing is something when you list an item you upload on an image and then based on image this AI listing feature detects some some like context from image and they write down item title, description, pricing, metadata. So as you can see for metadata structure is a synchronous which means that we can control the traffic fl flexibly. So we were able to use fine tuning alm but for AI listing we didn't use fine tetuning alm because when you deploy when you productionize atm performance is not always everything you need to take talk about you need to consider about like infra cost latency other toughs so like we don't have like much GPU so we found out that the performance gain from fine tuning comparing considering the like infra cost and latency is not that high considering just using ARM APIs so we don't use fine tuning adm for a racing. So provider experience we also actually have a chatbot drive recommendation which is like customerf facing chatbot but it didn't go to production to be honest because of like we find trouble to build up some evaluation for this kind of chatbot also we do have a lot of agent for internal use such as davosis which is embedding all the internal documents in the company and then I provide some best practices for social engineers to increase their productivity along with like other production tool productivity tools such as cursor and resurf and so we are like Japanese company but most of engineers non-sp speaking like non-Japanese speakers so we do have like a lot of translation agents and the HR agent so agent for self evaluation so with those we can see in two different ways actually that problem solved with traditional way with software engineering ML it's easy not because it's technically easy just because we know what's good and bad we have very strong like the ground truth data about it and there other stuffs actually the problems we have never solved which means it's difficult because we don't know what's good and bad which means we need more eers so for example let's take a look at content moderation is this is very simplified version of like content moderation pipeline when user risk an item this ML models picks up an event that are separate models per category and decide whether item is suspicious or not based on the features and And if it's detected as suspicious the operator finally checking whether item is suspicious or not which is labeling based on labeling data we do re retrain re training ML models on our MLOPs and actually when they label those items not only they use their like experience but also they need to use knowledge base because sometimes it's late with the Japanese law but ML moss don't know about this knowledge base so we deploy the one Adam apps which doesn't have any fine tuning only a system prompt plus lagging the internal documents by doing so we can use the same knowledge base with the operators and if up and learning is quite well in terms of overall cost is quite decreased so we expanding we expecting to expand this atom apps to others categories too this very successful case on our case so another example is like actually buyer experience which is customerf facing effort As I said, this is quite difficult just because it's hard to build able evils for this one. So, it didn't go to production actually. It's valid at the like PC level. And another example is AI listing. This is up and running quite well. It's like now it's one of the core part of our business. The concept is very simple. One of the obstacles that preventing user list an item is the when they list an item they need to write all the details like item title and description and pricing metadata. So instead of that when user download an item and then you decide category and based on that the app write down the system write down the item title description also pricing and the metadata. So when we build this app, we come up with this pipeline. First first create a prompt. This a this very truth for every like applications. It doesn't have to be very complicated. You can just come up with a simple simple script that consist of a for loops and the combination of several models and then prompts which is called these days as a vibe check and then go to the next level which is automatic evol within the team. We set up like some very simple evals and then launch some evaluation and pick the best auto bit and then go to the domain extra feedback because this apps tend to be focused on very specific task. In that case is actually that requires some domain specific knowledge. So we have a lot of collaboration chances with domain expert feedback domain experts. So we after that we iterate this process and so on and for like several times once it get to the certain level we deploy the model and the monitor. So for example for model and prompt this is a real prompt we are using right now but of course we use different prompt per category and input is the image output is title description item attributes of course it involved with some function calls actually. So after that we we got some vibes and then moved to the huristic matrix. We sample some goodist things of mercury and then come up with a huristic matrix and then along with some traditional matrix matrix such as blue and large and also we have like binary class binary classification evaluations such as like a Japanese detector whether it is written in Japanese or not. Even though we strongly instructed that try please write down in Japanese sometimes the model is trying to write down in English. So this is not everything but this is a very good starting point and then we move on to the Adam as a as a judge. I know how many like how many of you guys are familiar with this term. So let me explain simply this is very simple actually. So you let evaluate your map right? So there are a lot of evol like offenses such as a G varatus but I'm not going get into the details. So you can refer to like papers like this one also we come up with some custom prompt for scoring and the binary classification for performance of add map and then throughout this process we found out that some categories suffering from halluciation that's because of like lack of training data for certain categories such as trading cards. So we need to come we need to have we need to add another evolution evaluation called grounding which is called factchecking for that one we didn't come up with a tool we just use a to like API from vertex AI and then after going through all the evaluation the final part is always domain expert check which always need like human feedback and actually we came up with an in-house tour but I'm not allowed to show in this present in this conference so I just pick up one of the our partners product which is lens. As you can see this is very straightforward intuitive providing a UI tool that as a like domain expert can evaluate the performance of other apps. So here are like key learnings. You don't have to start you don't have to be too ambitious from the first. Start with a small data for five checks and then set a threshold and then do two do tiered emails because like as I said RM as a judge using RM to evaluate RM which means it cost double. So you first filter out some best models, best prompt with huristic evaluation and then move on to the next level and do tier deb bars and during evaluation sometime if if you think something is not going well evaluate your evaluation and while you evaluate actually you need to use a lot of tools use tools if there is if not make tool now it's easier than ever to come up with a tool than to learn right so like attention is not all you need actually sometime ears are all you it. So lastly, I I'd like to provide some action item for you guys. Like as I said, Adamos tend to be so focused on very specific matter which requires feedback from subject matter experts. So we need to collaborate with them as an engineer. So you when you want to if you want to scale your L apps, you need two things. evaluation of course it's very and then for evaluation you actually need to use a lot of tools by combining these tools you can actually scale your application and then you know some people said Adam also quite complicated but we didn't use we didn't go through all those tools actually even if you you don't need these tools at first maybe it needs some time some days but so start simple just come up with a Few script testing your moldas and system prompt and then grow to the needs of your problem and your team. For for example in our case we start with a G and JSON blobs even like for loops for testing like for the B checks and for asse data set was very small at first and then come up with a simple UI for human feedback and if if the need is grow and then if you need more tools then we started using like external tools such as weave. So this is the workflow we followed throughout the atom offs from developer side. You need to validate MVP first come up with like system prompt at first and then integrate observability reproducibility into your app using your tools after that you start to adding more evol. And then I'm going to go through like a visual I'm going to visualize how we come up with this processes but this not the actual data only just visualizing it. So one of our our engineers started MVP with the different moders and prompt and then throughout this V we can keep track of all the like models and system prompts and also you can set check the div of the input data also output data and then prompt management of course and the cord version buroning if you're using the function calls and also throughout this MVP you're not going to use only one data set you're going to use several data set And this we will provide like a data spend one too. You can check the details of data set which is very helpful and also evaluation this most important thing you can simply like visualize by the tables or the other stuff we add is like scoring system binary classification and so on and so forth. So with this observability and reproducibility by the tools we can iterate key in our cases. So one engineer was in charge of this task and then he ran like 20k lens within a week. So for example in one day he came up with the MVP and that's a like team permission team confirmation from EM and PDM by providing this visualized data and evaluation and they get a feedback in the very very few minutes thanks to the visualization and the very same day he was able to reproduce everything thanks to the web and then finally get the final confirmation within the temp and then move to the next level which is domain expert So formemes you can start with a built-in matrix. Built-in matrix means that if you use a tool actually they provide some built-in matrix based on like some experience or if there's a built if there is no built-in matrix can providememes from baselines and then they start to create like human feedback based on their knowledge about specific matter and then move on to the custom matrix using the prompt. After this you achieve alignment between matrix and the human feedback and then iterate quickly to improve quality and and so on and as first and then once you reach a certain level you can okay to deploy and productionize. So as I said I'm not allowed to like share our internal tools but instead I brought one of our partners tool which is called sat ais lens as you can see they provide some built-in matrix form apps also they provide some custom templates for come up with their own custom matrix and then this for like human feedback very intuitive so there are like takeaways don't be ambitious From the first start with a strong foundation whatever it is and then observability and reproducibility whatever tools you have for quick iteration because quick iteration to success and actually all I want to say is about evol actually there are a lot of good officers and good papers you can reference to and sometimes from our experiences good ears can beat the wagging and finetuning So here are like some open source and papers we referenced throughout the Adam map apps development. Then what's next? This no secret that even like base model develop like development contract for companies like open AI suffering from lack of eos. So which means that we need better ears now more than ever. So think about how many job interviews you've been through to get your job right now and think about how other maps go productionized without evs [Music] throughout these slides are done by my wonderful teammates and then previous team at Tio and other community leaders here are some resources that's all from my side. Thanks for listening.

Original Description

Mercari has invested heavily in DevOps, MLOps, and, recently, LLMOps with significant payoff to developer speed and quality, both in terms of software quality and quality of life. From prompt management, evaluation, and LLM application observability, this talk dives into major LLMOps focus areas and open-source software that were key to the success of our most ambitious projects, which have already delivered unparalleled customer value to over 23 million users of Japan's largest C2C e-commerce marketplace.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 0 of 60

← Previous Next →
1 0. What is machine learning?
0. What is machine learning?
Weights & Biases
2 1. Build Your First Machine Learning Model
1. Build Your First Machine Learning Model
Weights & Biases
3 Intro to ML: Course Overview
Intro to ML: Course Overview
Weights & Biases
4 2. Multi-Layer Perceptrons
2. Multi-Layer Perceptrons
Weights & Biases
5 3. Convolutional Neural Networks
3. Convolutional Neural Networks
Weights & Biases
6 Weights & Biases at OpenAI
Weights & Biases at OpenAI
Weights & Biases
7 Why Experiment Tracking is Crucial to OpenAI
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
8 4. Autoencoders
4. Autoencoders
Weights & Biases
9 5. Sentiment Analysis
5. Sentiment Analysis
Weights & Biases
10 6. Recurrent Neural Networks [RNNs]
6. Recurrent Neural Networks [RNNs]
Weights & Biases
11 7. Text Generation using LSTMs and GRUs
7. Text Generation using LSTMs and GRUs
Weights & Biases
12 8. Text Classification Using Convolutional Neural Networks
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
13 9. Hybrid LSTMs [Long Short-Term Memory]
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
14 Toyota Research Institute on Experiment Tracking with Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
15 Weights and Biases - Developer Tools for Deep Learning
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
16 Introducing Weights & Biases
Introducing Weights & Biases
Weights & Biases
17 10. Seq2Seq Models
10. Seq2Seq Models
Weights & Biases
18 11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
19 12. One-shot learning for teaching neural networks to classify objects never seen before
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
20 13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
21 14. Data Augmentation | Keras
14. Data Augmentation | Keras
Weights & Biases
22 15. Batch Size and Learning Rate in CNNs
15. Batch Size and Learning Rate in CNNs
Weights & Biases
23 Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
24 Grading Rubric for AI Applications with Sergey Karayev  (2019)
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
25 16. Video Frame Prediction using CNNs and LSTMs (2019)
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
26 Image to LaTeX - Applied Deep Learning Fellowship (2019)
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
27 17.  Build and Deploy an Emotion Classifier (2019)
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
28 Applied Deep Learning - Data Management with Josh Tobin (2019)
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
29 Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
30 Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
31 Troubleshooting and Iterating ML Models with Lee Redden (2019)
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
32 Designing a Machine Learning Project with Neal Khosla (2019)
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
33 Lukas Beiwald on ML Tools and Experiment Management (2019)
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
34 Building Machine Learning Teams with Josh Tobin (2019)
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
35 Pieter Abeel on Potential Deep Learning Research Directions  (2019)
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
36 Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
37 Five Lessons for Team-Oriented Research with Peter Welder (2019)
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
38 Applied Deep Learning - Rosanne Liu on AI Research (2019)
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
39 Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
40 Organizing ML projects — W&B walkthrough (2020)
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
41 Brandon Rohrer — Machine Learning in Production for Robots
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
42 Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
43 My experiments with Reinforcement Learning with Jariullah Safi
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
44 Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
45 Testing Machine Learning Models with Eric Schles
Testing Machine Learning Models with Eric Schles
Weights & Biases
46 How Linear Algebra is not like Algebra with Charles Frye
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
47 Predicting Protein Structures using Deep Learning with Jonathan King
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
48 Rachael Tatman — Conversational AI and Linguistics
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
49 Reformer by Han Lee
Reformer by Han Lee
Weights & Biases
50 Sequence Models with Pujaa Rajan
Sequence Models with Pujaa Rajan
Weights & Biases
51 GitHub Actions & Machine Learning Workflows with Hamel Husain
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
52 Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
53 Jack Clark — Building Trustworthy AI Systems
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
54 Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
55 Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
56 Antipatterns in open source research code with Jariullah Safi
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
57 Attention for time series forecasting & COVID predictions - Isaac Godfried
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
58 Made with ML - Goku Mohandas
Made with ML - Goku Mohandas
Weights & Biases
59 Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
60 Deep Learning Salon by Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases

The video teaches how to implement LLMOps for eval-driven development at scale, focusing on evaluation-centric approaches, fine-tuning, and prompt engineering. It highlights the importance of evaluation-centric approach for productionizing LLM apps and demonstrates the use of tools like LLaMA, ARM APIs, and Vertex AI API. By following the steps outlined in the video, viewers can learn how to build and deploy LLM apps using eval-driven development.

Key Takeaways
  1. Create a prompt
  2. Set up evals
  3. Launch evaluation
  4. Pick the best model
  5. Get domain expert feedback
  6. Start with a small data set and threshold
  7. Use tiered evaluation
  8. Collaborate with subject matter experts
  9. Use tools for evaluation and feedback
  10. Integrate observability and reproducibility into your app
💡 The evaluation-centric approach is crucial for productionizing LLM apps, and fine-tuning, prompt engineering, and observability are essential components of LLMOps.

Related AI Lessons

DevOps Took 10 Years to Mature.
MLOps is distinct from DevOps and solves unique problems, requiring a different approach
Medium · DevOps
Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI
Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance
Medium · DevOps
Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx
Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development
Dev.to · Shannon Dias
MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages
Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation
Dev.to AI
Up next
Pole Pruner How A Rope Lever Shears High Branches
Innoforge Studio
Watch →