Behavioral Testing of ML Models (Unit tests for machine learning)

Jay Alammar · Beginner ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%Fine-tuning LLMs60%

Key Takeaways

The video discusses applying software engineering techniques like unit testing to machine learning models, specifically behavioral testing, and introduces tools like the checklist open source library and various sentiment analysis models from BERT, RoBERTa, Microsoft, Google, and AWS. It covers concepts such as minimum functionality tests, invariance tests, and directional expectation tests to evaluate model performance and ensure quality.

Full Transcript

hello and welcome back i'm really excited about this video because i'll be discussing a topic that i'm really excited about it's one of the it's a recent idea in nlp and i'm really excited about it because it takes uh the unit testing from software engineering and applies it to machine learning and that enables us to do a few really cool things like being able to compare models better being able to have a sense of quality assurance for models as we continue to train them ideas called behavioral testing let's look at a comparison for example of two models so let's say we have model one and model two and you can compare them by accuracy and one is clearly has a higher accuracy than the other one but then what if we look at them in a different way what if we break down their capabilities so to speak one of them seems to so the one on the right with a lower accuracy maybe does better in two capabilities let's say and maybe in our use case we care more about these two than than this third one and in this scenario the one on the right despite having a lower accuracy or f1 or any sort of one number metric that does not give us enough of a picture about how the model really behaves using our topic today behavioral testing we'll be able to get these scorecards that score models on multiple axes so the paper we'll be discussing is beyond accuracy behavioral testing of nlp models with checklist uh by rubiero wu gustron and singh this is and i think an acl it was the one the best paper i believe in acl of of last year and the basic idea is this let's say we have a model let's say we have a small robot and we have an expected behavior we expect that this let's say tiny robot will go uh straight and then take a left uh how do we assure the quality so to speak of the of the behavior that the model actually works so the robot works how we expect it to do we can borrow this idea of unit testing from software engineering and say okay let's have a test here to make sure that the model reaches its end destination let's have a test here to make sure that the model does not go beyond where it's supposed to take a left turn let's have another test here to make sure that it actually did turn left and not right so you can just write these software tests to ensure that the behavior of your robot or your model is as you expect it and you can run these tests numerous times whenever you update it whenever you get a new model and then you can group these tests into capabilities and so in this case we can say okay how well does this robot let's say go in straight line so we maybe ran ten different tests for it going in straight lines uh maybe it succeeded in nine of the ten and so we have a score for this capability how well does it do in turning and stopping so on and so forth so this is a miniature model example but then let's look at how that can be done in nlp and that's the the topic of the paper nlp models so this is an example of a scorecard directly taken from from the paper uh let's look at the first column on the left so instead of turning and stopping we have actual linguistic capabilities that models are expected to have so how large their vocabulary is negation is the one that we're going to go into a little bit more and then we'll focus on this the second column first the minimum functionality test don't worry about the other two for now we'll get to them so negation is important and the main example of nlp models we'll talk about here and the one they discuss in the paper is sentiment analysis so given a sentence is the sentence saying something negative or something positive now if you have a sentence like this saying i like this product this is a positive sentiment this is clear but then what if you add i don't like this product it's exactly identical except for one word that does negation and if a sentiment analysis model does negation well or does not do it well that affects its final sort of goal and its performance so let's see how we can test negation using something like checklist basically the the unit test in the machine learning domain would be a small data set uh that we carved out that we know handles this behavior so we have examples of inputs and we have a label associated with them so here for example we want i don't like this product so this would be a negative sentiment uh example the food is not poor so it's not bad so it's either positive or neutral based on you know does this model have a neutral classification or is it just positive negative and then you have something that's uh the aircraft is not private so the sentence itself was neutral but then you negated the neutrality it should still be neutrality and so that should still be neutral so how do we run the test we feed the model the texts and we have it make predictions and then if we have a model making these predictions we just compare them with the labels this one got three wrong and one correct and so that's a 75 percent failure rate and this is a functionality test that tells us how well a model does on negation and then you're not limited to only having four you should probably have 50 or 100 or a lot more examples to test things this is an example from the paper so here you can see mfts are minimum functionality tests these are the tests that look like small data sets so this is let's say test number one negated negative should be positive or neutral and here they have examples so the food is not poor should be positive or neutral and then here they evaluate five different models so i think this is a bert this is roberta both trained on sentiment analysis and then these are commercial models so this is a sentiment analysis model for microsoft from google from aws and these are their failure rates so under this test there are let's say you know 20 or 30 or 50 examples and this is the failure rate so the lower these numbers are the better this model would be and then you can see tricky examples here like negation of a negative at the end should be positive so i thought the plane would be awful but it wasn't so all the models except for roberta find this a very difficult so they fail the majority of times so this is an interesting way of of how you can evaluate let's say models and you know quality assure them as as part of let's say cicd setup the second type of test is an invariance test and the example here is to say if let's say we have a neutral or a positive sentence like let's see turtles have shells if we change the example in a way that does not change sentiment so if we say feet in front instead of shells the prediction of the model should not change the class should not flip so this is an example of type of tests and the third type of example is called a directional expectation test and it goes like this let's say we have a sentence and then we predict it with the model and the model says okay this is 50 positive so it's between positive and negative let's say we add a negative portion to the end of the sentence how would the model react would this score increase or would it decrease so if it decreased the test would pass if the model says that this sentence including the word awful is actually more positive than this one then that's a test failure so the idea here is that these are kinds of tests that if we perturb or change the input in a way that we know should take the prediction in one direction that the model does not go the other way it can go up maybe 10 so this is up to you but this is the the default uh that they work with just to account for just the general behavior of of models and and how how they do that so it can go up a little bit but not too much here is a scorecard in more detail sort of comparing the various capabilities so here you have the part of speech and vocabulary and then you have the minimum functionality test what it looks like and so this is short sentences with neutral adjectives and nouns a lot of them get them correctly except this part fails at these for some reason and then here you have invariance tests so replacing neutral words with other neutral words so if you have this let's say maybe this is a tweet so should i be concerned when i'm about to fly originally it was should i be concerned that i'm about to fly and then that has a certain prediction then if we switch that to when the prediction shouldn't change and so there's a low failure rate across across the models but then there's still something they can probably update them to have even lower failure rates and then you have the directional uh expectation test here so if you add a positive phrase make sure that sentiment that the score does not go down that is not indicated to be more negative so to speak so they had you are extraordinary and so the score should actually go up and not go down and so this is this kind of test so to recap the first kind of test is the minimum functionality test this looks like unit tests there are small data sets that test specific capabilities then we have the invariance tests where we perturb change the inputs in a way that we know should not affect the output and we measure if it does or does not and the checklist open source library that they provide with this paper includes templates that that allow you to generate these kind of perturbations for the input then we have the directional expectation test where the idea is if we make a perturbation to the input that is expected to shift the output one way or another it's for us to to make sure that it goes in the way that we're expecting and not the other way so adding a negative word shouldn't make the prediction more positive so this has been your intro to behavioral testing of nlp models with checklists i invite you to read the paper check out the github library there are links down in the description and then i can see i can see several scenarios where so we talked about where this can be used so we talked about comparing models but i also i'm excited about having this just as you know running on if you have a model in production uh it's good to make sure that you know the next update or the next time you retrain it that one of the core capabilities that the model should be able to do has not degraded or that if you did degrade that you know about it just to carry over these these practices that make software more robust and software organizations more robust because i've seen it i've seen companies that struggle because of not investing enough in in automated testing and in software tests and unit tests so a lot of the engineers spend a lot of their time bug fixing and i've seen frustrated ceos by the the speed of the development that a lot of these software practices are able to to help an organization that is growing and the more we adopt that in machine learning and in machine learning practices um i think that that'll lead to better organizations and better practices and operations thank you for watching and see you in the next video

Original Description

How can we empower machine learning models with powerful software engineering techniques like unit testing? Evaluating ML models using a single metric (like accuracy or F1-score) produce a low-resolution picture of model performance. Behavioral tests can give us a much higher resolution evaluation of a model's capabilities. By creating tests (which are small targeted test sets), we can better compare models or observe how model performance changes after re-training a model (or fine-tuning it). We discuss the paper 'Beyond Accuracy: Behavioral Testing of NLP Models with CheckList', which was selected as the ACL 2020 Best Paper. Introduction (0:00) Comparing models using capabilities (0:33) Behavioral test of NLP models (3:06) Test Type 1: Minimum Functionality Tests (4:22) Test Type 2: Invariance Tests (7:04) Test Type 3: Directional Expectation Tests (7:32) Summary and Conclusion (10:00) ------ Paper: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList https://www.aclweb.org/anthology/2020.acl-main.442/ Code: https://github.com/marcotcr/checklist ------ Twitter: https://twitter.com/JayAlammar Blog: https://jalammar.github.io/ Mailing List: https://jayalammar.substack.com/ More videos by Jay: Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP) https://youtu.be/ioGry-89gqE Explainable AI Cheat Sheet - Five Key Categories https://www.youtube.com/watch?v=Yg3q5x7yDeM The Narrated Transformer Language Model https://youtu.be/-QH8fRhqFHM Jay's Visual Intro to AI https://www.youtube.com/watch?v=mSTCzNgDJy4 How GPT-3 Works - Easily Explained with Animations https://www.youtube.com/watch?v=MQnJZuBGmSQ

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Jay Alammar · Jay Alammar · 13 of 38

← Previous Next →

Jay's Visual Intro to AI

Jay's Visual Intro to AI

Making Money from AI by Predicting Sales - Jay's Intro to AI Part 2

Making Money from AI by Predicting Sales - Jay's Intro to AI Part 2

How GPT3 Works - Easily Explained with Animations

How GPT3 Works - Easily Explained with Animations

The Narrated Transformer Language Model

The Narrated Transformer Language Model

My Visualization Tools (my Apple Keynote setup for visualizations and animations)

My Visualization Tools (my Apple Keynote setup for visualizations and animations)

Explainable AI Cheat Sheet - Five Key Categories

Explainable AI Cheat Sheet - Five Key Categories

The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]

The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]

Neural Activations & Dataset Examples

Neural Activations & Dataset Examples

Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]

Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]

Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)

Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)

Inspecting Neural Networks with CCA - A Gentle Intro (Explainable AI for Deep Learning)

Inspecting Neural Networks with CCA - A Gentle Intro (Explainable AI for Deep Learning)

Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)

Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)

Behavioral Testing of ML Models (Unit tests for machine learning)

Behavioral Testing of ML Models (Unit tests for machine learning)

Favorite AI/ML Books: Intro to ML with Python (Book Review)

Favorite AI/ML Books: Intro to ML with Python (Book Review)

Favorite Python Books: Effective Python

Favorite Python Books: Effective Python

Favorite Stats Books: Seven Pillars of Statistical Wisdom

Favorite Stats Books: Seven Pillars of Statistical Wisdom

Understanding Animal Languages - Seeing Voices 2

Understanding Animal Languages - Seeing Voices 2

How digital assistants like Siri work #shorts

How digital assistants like Siri work #shorts

Writing Code in Jupyter Notebooks #shorts

Writing Code in Jupyter Notebooks #shorts

Experience Grounds Language: Improving language models beyond the world of text

Experience Grounds Language: Improving language models beyond the world of text

pandas for data science in python #shorts

pandas for data science in python #shorts

The Illustrated Retrieval Transformer

The Illustrated Retrieval Transformer

AI Image Generation is MIND BLOWING! #shorts

AI Image Generation is MIND BLOWING! #shorts

A Generalist Agent (Gato) - DeepMind's single model learns 600 tasks

A Generalist Agent (Gato) - DeepMind's single model learns 600 tasks

The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning

The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning

AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)

AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)

What is Generative AI? 4 Important Things to Know (about ChatGPT, MidJourney, Cohere & future AIs)

What is Generative AI? 4 Important Things to Know (about ChatGPT, MidJourney, Cohere & future AIs)

AI is Eating The World - This is Where YOU Can Use it to Compete (AI Product Moats)

AI is Eating The World - This is Where YOU Can Use it to Compete (AI Product Moats)

What is LangChain? Where does it fit with LLMs like ChatGPT and Cohere? #shorts

What is LangChain? Where does it fit with LLMs like ChatGPT and Cohere? #shorts

Are language models with more parameters better? #shorts #chatgpt

Are language models with more parameters better? #shorts #chatgpt

How to manage LLM prompts with tools like LangChain #languagemodels #chatgpt

How to manage LLM prompts with tools like LangChain #languagemodels #chatgpt

What is Llama Index? how does it help in building LLM applications? #languagemodels #chatgpt

What is Llama Index? how does it help in building LLM applications? #languagemodels #chatgpt

prompt chains are important for building large language model applications

prompt chains are important for building large language model applications

ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.

ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.

What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more

What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more

Building LLM Agents with Tool Use

Building LLM Agents with Tool Use

SWE-Bench authors reflect on the state of LLM agents at Neurips 2024

SWE-Bench authors reflect on the state of LLM agents at Neurips 2024

Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course]

Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course]

The video teaches how to apply behavioral testing to machine learning models, using techniques like unit testing and minimum functionality tests, to ensure model quality and performance. It introduces various tools and libraries, such as the checklist open source library, and discusses the importance of negation in NLP models. By applying these techniques, developers can improve the robustness and development speed of their machine learning models.

Key Takeaways

Apply unit testing to machine learning models
Use checklist for unit testing
Run minimum functionality tests
Run invariance tests
Use directional expectation tests to check model behavior
Feed model texts and compare predictions with labels
Use the checklist open source library to generate perturbations for input

💡 Behavioral testing can provide a higher resolution evaluation of model performance, beyond a single metric like accuracy or F1-score, by checking for minimum functionality, invariance, and directional expectation.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Embeddings Simplified

Learn the basics of embeddings and how they simplify complex data, a crucial concept in AI and ML

I built a tool that cuts Claude/ChatGPT token usage by 97% — here's how it works

Learn how to build a tool that reduces Claude/ChatGPT token usage by 97%, increasing productivity and efficiency in debugging and development

Dev.to · Rohith Matam

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)