LLM Result Analysis Explained: Chapter 23

Weights & Biases · Intermediate ·🧠 Large Language Models ·2y ago

Key Takeaways

The video demonstrates how to analyze LLM evaluation results using Weights & Biases dashboard, focusing on model errors and potential improvements.

Full Transcript

[Music] the evaluation round has finished and we can now analyze the evaluation results in weight Andes dashboard if you remember our script we loed the evaluation results for interactive analysis in a weit syas table and we can see this table in our dashboard and maybe to make it a bit more readable I'll open this up in full screen and maybe let's uh limit the number of rows that is displayed at a point in time and I can see all of the results here including the correct and incorrect results maybe let's focus on the examples where the model is making errors so let's uh let's uh do a filter and we'll filter by the model score and we want to uh exclude examples where the model that the Model judges to be correct and that will focus on the model errors at least based on the model based evaluation and let's look at some of these examples the question is how do I start a sweep with weights and biases the ideal answer is um quite comprehensive then the model answer is is very concise and succinct and probably does not include all of the relevant information from the ideal answer and that's why the model is judging it as incorrect so in this case may we should look at a prompt maybe the prompt asks the model to be concise and that might not be helping so there may be some many different reasons on why this model answer is not so comprehensive and we can explore this and uh and try to fix this and then see how uh that looks in the next evaluation run let's take a look at some other examples again in this case we can see the question is about 1 db. watch and the ideal answer is pretty comprehensive the model answer is very succinct how can I modify it the weight invis Table after it has been loged with new data the ideal answer tells us that generally this is not possible but there are two options of adding new data to existing table uh however the model based answer does not say this is not possible and just gives one method which is not necessarily the the correct answer to this question we can Pro sit in this way analyze where the model is making errors get an intuition into which improvements could help and improve the score Implement that and run our evaluation again and compare the results and ideally as you implement this process you will be able to improve the metric that you care about in this case it could be the model accuracy and your application will uh work better for your end users

Original Description

🤖 Unlock LLM Performance Insights! In Chapter 23 we dive into analyzing LLM evaluation results. 🧑🏾‍🎓 Full course with certification and class materials available free at http://wandb.me/building-llm-powered-apps 🏆 Daily swag draw and grand prize Airpods draw from Dec 1 and 31, 2023. Details at http://wandb.me/llm-apps-contest 🗣️ Join the course conversation on our Discord channel at http://wandb.me/course-discord *Episode Description* In this chapter of our "Building LLM-Powered Apps" course, brought to you by Weights & Biases, Darek Kleczek, W&B Machine Learning Engineer, takes you through the process of analyzing evaluation results for LLM applications. Learn how to interpret and utilize data from the Weights & Biases dashboard to identify and rectify errors in your LLM application. 🌟 Chapter Highlights Deep Dive into Evaluation Results: Explore the steps of analyzing evaluation results in the Weights & Biases dashboard. Interactive Analysis Techniques: Discover how to use filters and interactive tables to focus on areas where the model underperforms. Error Identification and Insights: Understand how to identify specific errors and gain insights into potential improvements for your LLM application. Continuous Improvement Cycle: Learn the iterative process of analyzing, implementing fixes, and re-evaluating to enhance the overall performance of LLM applications. 🎓 Enroll for Free: Join us on this educational journey to master the art of building LLM-powered applications. Enroll at http://wandb.me/building-llm-powered-apps. 👉 Next Chapter Sneak Peek: Don't miss our final chapter, where we wrap up the course with a recap and explore the next steps in LLM application development.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 0 of 60

← Previous Next →
1 0. What is machine learning?
0. What is machine learning?
Weights & Biases
2 1. Build Your First Machine Learning Model
1. Build Your First Machine Learning Model
Weights & Biases
3 Intro to ML: Course Overview
Intro to ML: Course Overview
Weights & Biases
4 2. Multi-Layer Perceptrons
2. Multi-Layer Perceptrons
Weights & Biases
5 3. Convolutional Neural Networks
3. Convolutional Neural Networks
Weights & Biases
6 Weights & Biases at OpenAI
Weights & Biases at OpenAI
Weights & Biases
7 Why Experiment Tracking is Crucial to OpenAI
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
8 4. Autoencoders
4. Autoencoders
Weights & Biases
9 5. Sentiment Analysis
5. Sentiment Analysis
Weights & Biases
10 6. Recurrent Neural Networks [RNNs]
6. Recurrent Neural Networks [RNNs]
Weights & Biases
11 7. Text Generation using LSTMs and GRUs
7. Text Generation using LSTMs and GRUs
Weights & Biases
12 8. Text Classification Using Convolutional Neural Networks
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
13 9. Hybrid LSTMs [Long Short-Term Memory]
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
14 Toyota Research Institute on Experiment Tracking with Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
15 Weights and Biases - Developer Tools for Deep Learning
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
16 Introducing Weights & Biases
Introducing Weights & Biases
Weights & Biases
17 10. Seq2Seq Models
10. Seq2Seq Models
Weights & Biases
18 11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
19 12. One-shot learning for teaching neural networks to classify objects never seen before
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
20 13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
21 14. Data Augmentation | Keras
14. Data Augmentation | Keras
Weights & Biases
22 15. Batch Size and Learning Rate in CNNs
15. Batch Size and Learning Rate in CNNs
Weights & Biases
23 Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
24 Grading Rubric for AI Applications with Sergey Karayev  (2019)
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
25 16. Video Frame Prediction using CNNs and LSTMs (2019)
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
26 Image to LaTeX - Applied Deep Learning Fellowship (2019)
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
27 17.  Build and Deploy an Emotion Classifier (2019)
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
28 Applied Deep Learning - Data Management with Josh Tobin (2019)
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
29 Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
30 Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
31 Troubleshooting and Iterating ML Models with Lee Redden (2019)
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
32 Designing a Machine Learning Project with Neal Khosla (2019)
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
33 Lukas Beiwald on ML Tools and Experiment Management (2019)
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
34 Building Machine Learning Teams with Josh Tobin (2019)
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
35 Pieter Abeel on Potential Deep Learning Research Directions  (2019)
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
36 Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
37 Five Lessons for Team-Oriented Research with Peter Welder (2019)
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
38 Applied Deep Learning - Rosanne Liu on AI Research (2019)
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
39 Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
40 Organizing ML projects — W&B walkthrough (2020)
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
41 Brandon Rohrer — Machine Learning in Production for Robots
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
42 Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
43 My experiments with Reinforcement Learning with Jariullah Safi
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
44 Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
45 Testing Machine Learning Models with Eric Schles
Testing Machine Learning Models with Eric Schles
Weights & Biases
46 How Linear Algebra is not like Algebra with Charles Frye
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
47 Predicting Protein Structures using Deep Learning with Jonathan King
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
48 Rachael Tatman — Conversational AI and Linguistics
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
49 Reformer by Han Lee
Reformer by Han Lee
Weights & Biases
50 Sequence Models with Pujaa Rajan
Sequence Models with Pujaa Rajan
Weights & Biases
51 GitHub Actions & Machine Learning Workflows with Hamel Husain
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
52 Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
53 Jack Clark — Building Trustworthy AI Systems
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
54 Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
55 Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
56 Antipatterns in open source research code with Jariullah Safi
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
57 Attention for time series forecasting & COVID predictions - Isaac Godfried
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
58 Made with ML - Goku Mohandas
Made with ML - Goku Mohandas
Weights & Biases
59 Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
60 Deep Learning Salon by Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases

This video teaches how to analyze LLM evaluation results, identify model errors, and improve model accuracy using Weights & Biases dashboard. By following the steps outlined in the video, viewers can gain insights into their LLM's performance and make data-driven decisions to improve its accuracy.

Key Takeaways
  1. Load evaluation results into Weights & Biases dashboard
  2. Filter results to focus on model errors
  3. Analyze model errors and identify potential improvements
  4. Implement improvements and re-run evaluation
  5. Compare results and refine improvements
💡 Analyzing model errors and identifying potential improvements can help increase model accuracy and overall LLM performance.

Related Reads

📰
Unlocking the LLM’s Hidden Knowledge Engine: The 3X Matrix Expansion in FFN and SwiGLU
Learn how Large Language Models inflate and shrink matrix dimensions and the hardware math behind it, to unlock their hidden knowledge engine
Medium · LLM
📰
A Brief History of Artificial Intelligence and Machine Learning
Learn the history of AI and ML to understand their evolution and current impact
Medium · Machine Learning
📰
A Brief History of Artificial Intelligence and Machine Learning
Learn the history of AI and ML to understand their evolution and current impact
Medium · Deep Learning
📰
I Know What an LLM Is, But What Is a World Model?
Learn about World Models and their relationship with Large Language Models (LLMs) to understand the next evolution in AI technology
Medium · LLM
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →