Behavioral Testing of ML Models (Unit tests for machine learning)
Key Takeaways
The video discusses applying software engineering techniques like unit testing to machine learning models, specifically behavioral testing, and introduces tools like the checklist open source library and various sentiment analysis models from BERT, RoBERTa, Microsoft, Google, and AWS. It covers concepts such as minimum functionality tests, invariance tests, and directional expectation tests to evaluate model performance and ensure quality.
Full Transcript
hello and welcome back i'm really excited about this video because i'll be discussing a topic that i'm really excited about it's one of the it's a recent idea in nlp and i'm really excited about it because it takes uh the unit testing from software engineering and applies it to machine learning and that enables us to do a few really cool things like being able to compare models better being able to have a sense of quality assurance for models as we continue to train them ideas called behavioral testing let's look at a comparison for example of two models so let's say we have model one and model two and you can compare them by accuracy and one is clearly has a higher accuracy than the other one but then what if we look at them in a different way what if we break down their capabilities so to speak one of them seems to so the one on the right with a lower accuracy maybe does better in two capabilities let's say and maybe in our use case we care more about these two than than this third one and in this scenario the one on the right despite having a lower accuracy or f1 or any sort of one number metric that does not give us enough of a picture about how the model really behaves using our topic today behavioral testing we'll be able to get these scorecards that score models on multiple axes so the paper we'll be discussing is beyond accuracy behavioral testing of nlp models with checklist uh by rubiero wu gustron and singh this is and i think an acl it was the one the best paper i believe in acl of of last year and the basic idea is this let's say we have a model let's say we have a small robot and we have an expected behavior we expect that this let's say tiny robot will go uh straight and then take a left uh how do we assure the quality so to speak of the of the behavior that the model actually works so the robot works how we expect it to do we can borrow this idea of unit testing from software engineering and say okay let's have a test here to make sure that the model reaches its end destination let's have a test here to make sure that the model does not go beyond where it's supposed to take a left turn let's have another test here to make sure that it actually did turn left and not right so you can just write these software tests to ensure that the behavior of your robot or your model is as you expect it and you can run these tests numerous times whenever you update it whenever you get a new model and then you can group these tests into capabilities and so in this case we can say okay how well does this robot let's say go in straight line so we maybe ran ten different tests for it going in straight lines uh maybe it succeeded in nine of the ten and so we have a score for this capability how well does it do in turning and stopping so on and so forth so this is a miniature model example but then let's look at how that can be done in nlp and that's the the topic of the paper nlp models so this is an example of a scorecard directly taken from from the paper uh let's look at the first column on the left so instead of turning and stopping we have actual linguistic capabilities that models are expected to have so how large their vocabulary is negation is the one that we're going to go into a little bit more and then we'll focus on this the second column first the minimum functionality test don't worry about the other two for now we'll get to them so negation is important and the main example of nlp models we'll talk about here and the one they discuss in the paper is sentiment analysis so given a sentence is the sentence saying something negative or something positive now if you have a sentence like this saying i like this product this is a positive sentiment this is clear but then what if you add i don't like this product it's exactly identical except for one word that does negation and if a sentiment analysis model does negation well or does not do it well that affects its final sort of goal and its performance so let's see how we can test negation using something like checklist basically the the unit test in the machine learning domain would be a small data set uh that we carved out that we know handles this behavior so we have examples of inputs and we have a label associated with them so here for example we want i don't like this product so this would be a negative sentiment uh example the food is not poor so it's not bad so it's either positive or neutral based on you know does this model have a neutral classification or is it just positive negative and then you have something that's uh the aircraft is not private so the sentence itself was neutral but then you negated the neutrality it should still be neutrality and so that should still be neutral so how do we run the test we feed the model the texts and we have it make predictions and then if we have a model making these predictions we just compare them with the labels this one got three wrong and one correct and so that's a 75 percent failure rate and this is a functionality test that tells us how well a model does on negation and then you're not limited to only having four you should probably have 50 or 100 or a lot more examples to test things this is an example from the paper so here you can see mfts are minimum functionality tests these are the tests that look like small data sets so this is let's say test number one negated negative should be positive or neutral and here they have examples so the food is not poor should be positive or neutral and then here they evaluate five different models so i think this is a bert this is roberta both trained on sentiment analysis and then these are commercial models so this is a sentiment analysis model for microsoft from google from aws and these are their failure rates so under this test there are let's say you know 20 or 30 or 50 examples and this is the failure rate so the lower these numbers are the better this model would be and then you can see tricky examples here like negation of a negative at the end should be positive so i thought the plane would be awful but it wasn't so all the models except for roberta find this a very difficult so they fail the majority of times so this is an interesting way of of how you can evaluate let's say models and you know quality assure them as as part of let's say cicd setup the second type of test is an invariance test and the example here is to say if let's say we have a neutral or a positive sentence like let's see turtles have shells if we change the example in a way that does not change sentiment so if we say feet in front instead of shells the prediction of the model should not change the class should not flip so this is an example of type of tests and the third type of example is called a directional expectation test and it goes like this let's say we have a sentence and then we predict it with the model and the model says okay this is 50 positive so it's between positive and negative let's say we add a negative portion to the end of the sentence how would the model react would this score increase or would it decrease so if it decreased the test would pass if the model says that this sentence including the word awful is actually more positive than this one then that's a test failure so the idea here is that these are kinds of tests that if we perturb or change the input in a way that we know should take the prediction in one direction that the model does not go the other way it can go up maybe 10 so this is up to you but this is the the default uh that they work with just to account for just the general behavior of of models and and how how they do that so it can go up a little bit but not too much here is a scorecard in more detail sort of comparing the various capabilities so here you have the part of speech and vocabulary and then you have the minimum functionality test what it looks like and so this is short sentences with neutral adjectives and nouns a lot of them get them correctly except this part fails at these for some reason and then here you have invariance tests so replacing neutral words with other neutral words so if you have this let's say maybe this is a tweet so should i be concerned when i'm about to fly originally it was should i be concerned that i'm about to fly and then that has a certain prediction then if we switch that to when the prediction shouldn't change and so there's a low failure rate across across the models but then there's still something they can probably update them to have even lower failure rates and then you have the directional uh expectation test here so if you add a positive phrase make sure that sentiment that the score does not go down that is not indicated to be more negative so to speak so they had you are extraordinary and so the score should actually go up and not go down and so this is this kind of test so to recap the first kind of test is the minimum functionality test this looks like unit tests there are small data sets that test specific capabilities then we have the invariance tests where we perturb change the inputs in a way that we know should not affect the output and we measure if it does or does not and the checklist open source library that they provide with this paper includes templates that that allow you to generate these kind of perturbations for the input then we have the directional expectation test where the idea is if we make a perturbation to the input that is expected to shift the output one way or another it's for us to to make sure that it goes in the way that we're expecting and not the other way so adding a negative word shouldn't make the prediction more positive so this has been your intro to behavioral testing of nlp models with checklists i invite you to read the paper check out the github library there are links down in the description and then i can see i can see several scenarios where so we talked about where this can be used so we talked about comparing models but i also i'm excited about having this just as you know running on if you have a model in production uh it's good to make sure that you know the next update or the next time you retrain it that one of the core capabilities that the model should be able to do has not degraded or that if you did degrade that you know about it just to carry over these these practices that make software more robust and software organizations more robust because i've seen it i've seen companies that struggle because of not investing enough in in automated testing and in software tests and unit tests so a lot of the engineers spend a lot of their time bug fixing and i've seen frustrated ceos by the the speed of the development that a lot of these software practices are able to to help an organization that is growing and the more we adopt that in machine learning and in machine learning practices um i think that that'll lead to better organizations and better practices and operations thank you for watching and see you in the next video
Original Description
How can we empower machine learning models with powerful software engineering techniques like unit testing?
Evaluating ML models using a single metric (like accuracy or F1-score) produce a low-resolution picture of model performance. Behavioral tests can give us a much higher resolution evaluation of a model's capabilities. By creating tests (which are small targeted test sets), we can better compare models or observe how model performance changes after re-training a model (or fine-tuning it). We discuss the paper 'Beyond Accuracy: Behavioral Testing of NLP Models with CheckList', which was selected as the ACL 2020 Best Paper.
Introduction (0:00)
Comparing models using capabilities (0:33)
Behavioral test of NLP models (3:06)
Test Type 1: Minimum Functionality Tests (4:22)
Test Type 2: Invariance Tests (7:04)
Test Type 3: Directional Expectation Tests (7:32)
Summary and Conclusion (10:00)
------
Paper: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
https://www.aclweb.org/anthology/2020.acl-main.442/
Code:
https://github.com/marcotcr/checklist
------
Twitter: https://twitter.com/JayAlammar
Blog: https://jalammar.github.io/
Mailing List: https://jayalammar.substack.com/
More videos by Jay:
Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)
https://youtu.be/ioGry-89gqE
Explainable AI Cheat Sheet - Five Key Categories
https://www.youtube.com/watch?v=Yg3q5x7yDeM
The Narrated Transformer Language Model
https://youtu.be/-QH8fRhqFHM
Jay's Visual Intro to AI
https://www.youtube.com/watch?v=mSTCzNgDJy4
How GPT-3 Works - Easily Explained with Animations
https://www.youtube.com/watch?v=MQnJZuBGmSQ
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Jay Alammar · Jay Alammar · 13 of 38
1
2
3
4
5
6
7
8
9
10
11
12
▶
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Jay's Visual Intro to AI
Jay Alammar
Making Money from AI by Predicting Sales - Jay's Intro to AI Part 2
Jay Alammar
How GPT3 Works - Easily Explained with Animations
Jay Alammar
The Narrated Transformer Language Model
Jay Alammar
My Visualization Tools (my Apple Keynote setup for visualizations and animations)
Jay Alammar
Explainable AI Cheat Sheet - Five Key Categories
Jay Alammar
The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]
Jay Alammar
Neural Activations & Dataset Examples
Jay Alammar
Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]
Jay Alammar
Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning)
Jay Alammar
Inspecting Neural Networks with CCA - A Gentle Intro (Explainable AI for Deep Learning)
Jay Alammar
Language Processing with BERT: The 3 Minute Intro (Deep learning for NLP)
Jay Alammar
Behavioral Testing of ML Models (Unit tests for machine learning)
Jay Alammar
Favorite AI/ML Books: Intro to ML with Python (Book Review)
Jay Alammar
Favorite Python Books: Effective Python
Jay Alammar
Favorite Stats Books: Seven Pillars of Statistical Wisdom
Jay Alammar
Understanding Animal Languages - Seeing Voices 2
Jay Alammar
How digital assistants like Siri work #shorts
Jay Alammar
Writing Code in Jupyter Notebooks #shorts
Jay Alammar
Experience Grounds Language: Improving language models beyond the world of text
Jay Alammar
pandas for data science in python #shorts
Jay Alammar
The Illustrated Retrieval Transformer
Jay Alammar
AI Image Generation is MIND BLOWING! #shorts
Jay Alammar
A Generalist Agent (Gato) - DeepMind's single model learns 600 tasks
Jay Alammar
The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning
Jay Alammar
AI Art Explained: How AI Generates Images (Stable Diffusion, Midjourney, and DALLE)
Jay Alammar
What is Generative AI? 4 Important Things to Know (about ChatGPT, MidJourney, Cohere & future AIs)
Jay Alammar
AI is Eating The World - This is Where YOU Can Use it to Compete (AI Product Moats)
Jay Alammar
What is LangChain? Where does it fit with LLMs like ChatGPT and Cohere? #shorts
Jay Alammar
Are language models with more parameters better? #shorts #chatgpt
Jay Alammar
How to manage LLM prompts with tools like LangChain #languagemodels #chatgpt
Jay Alammar
What is Llama Index? how does it help in building LLM applications? #languagemodels #chatgpt
Jay Alammar
prompt chains are important for building large language model applications
Jay Alammar
ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.
Jay Alammar
What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more
Jay Alammar
Building LLM Agents with Tool Use
Jay Alammar
SWE-Bench authors reflect on the state of LLM agents at Neurips 2024
Jay Alammar
Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course]
Jay Alammar
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Embeddings Simplified
Medium · RAG
I built a tool that cuts Claude/ChatGPT token usage by 97% — here's how it works
Dev.to · Rohith Matam
Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Medium · AI
🎓
Tutor Explanation
DeepCamp AI