[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Key Takeaways
The video demonstrates the use of CheckList to generate and test different scenarios for NLP models, testing models along two dimensions: capacity and testing type, and leveraging Hugging Face Transformers pipeline for question answering with a custom model.
Full Transcript
hello everyone um today i'd like to talk about this this paper that i really like beyond accuracy behavior testing of nlp models with checklist so i like this paper so much because it applies the unit testing mindset to do behavioral testing for nlp models i think the same mindset and framework can be potentially applied to to any other models um yeah so in this video i'm going to talk about this paper briefly and i'll show you uh how to use checklist package uh in a in a jupyter notebook so there are a lot of behavioral testing papers out there one of the paper this one is quite interesting that question answering tasks i used um so so the question so here's a question and here's the context so this paper gives the sentence why how because to kill american people edit at the end of the context and the model will predict to kill american people for those change sentences this is quite disturbing and really concerning to look at and there there are a lot of examples of this checklist provides a framework to automatically generate and test those different scenarios and examples to see where our our model fails um it tested models along two dimensions one is capacity which is a list of um tests it can run um another dimension is testing type which includes minimum functionality test invariance test and directional test so here are three examples of for sentiment analysis for minimum functionality tests we generate simple test cases from from a template and to see if the model predicts the same as the expected labels for the inverse test we do some small data on augmentations and expect the labels to stay the same in this example we change the location from chicago to dallas and the location change should not change the sentiment of the sentence and the third case is a directional directional test in in this example the author added a negative sentiment sentence at the end of each sentence and expect the sentiment to go down not going up so what's interesting that is that this paper tested on all of the state of art commercial models all of the all those models performs human behavior but when we look at those simple test cases it has a really high failure rate the first table is with sentiment analysis um there's another table with the sentence duplications and then finally we have this machine comprehension or question answering question answering tasks we see for for this machine comprehension task in particular the failure rate is really really high okay so uh how can we oh one other thing so there are a lot of um videos on this papers um so this is one of them i will link uh the resources and materials in the description um of this video um and then okay so how can we use this checklist um tool this checklist package uh what's up so this paper is amazing that not only does it have a lot of video tutorials but also it has a checklist package to help us implement this with our own models so this is their their github page you can pip install on the checklist and the needed notebook extensions in their notebook section there are a bunch of notebooks so i went through um some of them and here for the question answering task i was i was checking out this notebook and try to create a test suite and do some testing um the original code you can find here but in my example i changed the data also the model and some other small functions to to to be able to run my example smoothly and i will link all my notebooks in the descriptions below so you can take a look okay so how does this actually work when we need to first of all import all the needed packages and modules and here i'm using a hugging phase model this is how hacking phase works we give it a a transformers pipeline calling a question answering model if we don't provide this model it would default to us to do this model and load this model and then i'll give it a context in question this model will give us an answer with a confidence score if you have your own model you can load your model directly in this case i have a trained model we can load the model and the token from from this train model you can see the it gives us give us the same answer with a little better confidence score okay now coming to hugging face hugging face has this large i guess a word bank that uh will give us a list of um named entities or things i guess words for example it will give us a list of first names last names locations whatever but for things that checklist doesn't provide we can use editor dot suggest function and using this mask tag and then checklist will use a roberta model to help help us automatically generate those edge test adjectives we need um okay so with all those words we can use the editor.template function to automatically generate examples for our testing cases um yeah for example here we generated an example of alice's richer than joseph the question is who is richer and we expect the label to be alice but the model predicted joseph um so luckily in this example the failure rate is fairly low only three percent so our model is able to understand this context and this question and then give us the correct answer basically however if we change the question to who is less something then all of a sudden the failure rate increased to 99 that means our model cannot understand those sentence uh so this first example is with the minimum functionality test at mft the second type of test is the invariance test um in is in this example we give our question a typo so this question becomes this part of it's a typo basically and then we expect this environment function expect the the prediction between those two questions to be the same and in this case the predictions are not the same so it's one of the failure cases the failure rate in this case is 17 another interesting thing i want to show you is um that checklist can be a tool to test fairness um see if a model have any of the gender bias regional bias and and so on this example tested gender bias generates examples of male female is not a profession female male is so who is this profession so so this is the name of the female name of the male and then we see if the failure rate between male and female are different or similar across different professions okay so this notebook has a lot of test cases as you can see here um one of the thing that uh the author did was that we were able to save all those test cases um add them to a test suite and save those to a pickle file um so in another notebook as i'm showing here that we can just simply load the model load the test suite and run this test suite for all the tests um and then it provides a nice summary for all the tests and we can visualize this summary really nicely here are the two examples we're just talking about when zero percent was 100 um so and then we can see um failure examples here it's quite nice and now we can see uh the question typos with examples we also have like all the different examples we added a random sentence to the context and we see 10 of the time the model can produce the same result um so so this is how checklist works we generate tests um automatically uh based on the templates or we're doing various type tests and we create a test suite save to a pickle file and then we can run the test suite all together and visualize the results it's all great and then i was wondering what if we train our model based on those test cases so i actually combined the the stanford question answering data set and the all the test cases from my test fleet um and then i i i ran the test on a on the same test suite but with different words and adjectives and all that and surprisingly the failure rate dropped to zero percent um so which this means the model can be trained quite easily if we find the issues of the model um but finding the issue can be tricky and hard this is where checklist provides values to help help guide and help improve the model guide the researchers to find where the problems are so yeah um yeah i will link all the notebooks in the description so you can test it out and i encourage you to to check out this paper and also the repository of of this paper this is quite amazing i really like this paper hope you enjoy it thank you
Original Description
Notebooks: https://github.com/sophiamyang/NLP_testing
Paper: Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020
Learning Resources:
https://github.com/marcotcr/checklist
https://slideslive.com/38929272/beyond-accuracy-behavioral-testing-of-nlp-models-with-checklist
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Sophia Yang · Sophia Yang · 6 of 60
1
2
3
4
5
▶
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Customer lifetime value in a discrete-time contractual setting (math and Python implementation)
Sophia Yang
Time series analysis using Prophet in Python — Math explained
Sophia Yang
Multiclass logistic/softmax regression from scratch
Sophia Yang
Deploy a Python Visualization Panel App to Google Cloud App Engine
Sophia Yang
Deploy a Python Visualization Panel App to Google Cloud Run
Sophia Yang
[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Sophia Yang
5-step data science workflow
Sophia Yang
Multi-armed bandit algorithms - ETC Explore then Commit
Sophia Yang
Multi-armed bandit algorithms - Epsilon greedy algorithm
Sophia Yang
User retention analysis framework | data science product sense
Sophia Yang
Visualization and Interactive Dashboard in Python: My favorite Python Viz tools — HoloViz
Sophia Yang
Multi-armed bandit algorithms: Thompson Sampling
Sophia Yang
The Easiest Way to Create an Interactive Dashboard in Python
Sophia Yang
Big Data Visualization Using Datashader in Python | How does Datashader work and why is it so fast?
Sophia Yang
Why do you want to be a data scientist? Don't be a data scientist if ...
Sophia Yang
Johnny Depp v Amber Heard Twitter Sentiment Analysis | Is Camille Vasquez the real winner | 🤗 NLP
Sophia Yang
How to build a product that sells itself | Product-led Growth | Book Summary | Read a book with me
Sophia Yang
Designing Machine Learning Systems | book summary | Read a book with me
Sophia Yang
Where do data scientists/analysts go next? Love and hate in data analytics (ft. Shashank Kalanithi)
Sophia Yang
Meet the Author: Fundamentals of Data Engineering | DS/ML book club
Sophia Yang
What's new in hvPlot releases 0.8.0 & 0.8.1?
Sophia Yang
Meet the Author: Machine Learning Design Patterns | What do ML/Research Engineers do at Google?
Sophia Yang
Machine Learning Design Patterns | Google Executive | Investor | Meet the Author
Sophia Yang
How to solve data quality issues | Data Reliability | Meet the Author
Sophia Yang
Reliable Machine Learning author interview | DS/ML book club
Sophia Yang
Toronto VLOG | First vlog | Meet my favorite author | Toronto ML Summit conference
Sophia Yang
TOP 6 tech news in 2022 #shorts
Sophia Yang
How to deploy a Panel app to Hugging Face using Docker?
Sophia Yang
Tech news this week | ChatGPT, Hacks, Snowflake, CES #shorts
Sophia Yang
🗞️ Tech news this week: ChatGPT, DreamerV3, Muse, VALL-E, Mineral, DoNotPay, Tesla, SBF... #shorts
Sophia Yang
Tech news this week | Boston Dynamics, Microsoft, Snowflake, Google, and more #shorts
Sophia Yang
The story of Metaflow | Effective Data Science Infrastructure | Book author interview
Sophia Yang
Tech news this week #shorts
Sophia Yang
A day in life of a data scientist | Data Day Texas | Interview 12 authors/speakers
Sophia Yang
Tech news this week #shorts
Sophia Yang
Explainable AI with Shapley Values (Part 1: Game Theory)
Sophia Yang
Explainable AI with Shapley Values (Part 2: Estimate Shapley Values)
Sophia Yang
Explainable AI with Shapley Values (Part 3: KernelSHAP)
Sophia Yang
Tech news this week | AI search war between Microsoft and Google #shorts
Sophia Yang
The Story of ChatGPT's creator OpenAI | From Riches to Fame
Sophia Yang
Explainable AI for Practitioners | Must-read for XAI | author interview
Sophia Yang
Train your own language model with nanoGPT | Let’s build a songwriter
Sophia Yang
The easiest way to work with large language models | Learn LangChain in 10min
Sophia Yang
The BEST browser? AI article summary, image generation, website insights. Microsoft Edge Copilot!
Sophia Yang
startup scene in data | insights from 50+ data startups from Data Council
Sophia Yang
NLP with Transformers author interview with Lewis Tunstall from Hugging Face
Sophia Yang
4 ways to do question answering in LangChain | chat with long PDF docs | BEST method
Sophia Yang
5 Steps to Build a Question Answering PDF Chatbot: LangChain + OpenAI + Panel + HuggingFace.
Sophia Yang
4 Autonomous AI Agents: “Westworld” simulation, Camel, BabyAGI, AutoGPT, Camel ⭐ LangChain ⭐
Sophia Yang
MiniGPT4: image understanding & open-source!
Sophia Yang
BEST Practices in Prompt Engineering: Learnings and Thoughts from Andrew Ng's New Course
Sophia Yang
Designing Machine Learning Systems author interview with Chip Huyen
Sophia Yang
Tech news this week: code interpreter, Mojo, Redpajama, MPT7b, StarCoder #shorts
Sophia Yang
🤗 Hugging Face Transformers Agent | LangChain comparisons
Sophia Yang
📢 Tech news this week #shorts
Sophia Yang
📢 Tech news this week #shorts
Sophia Yang
The BEST ChatGPT Plugins | Brand NEW Bing Search | Web browsing, CODING, summarizing, and more
Sophia Yang
Tech news this week #shorts #short
Sophia Yang
📢 Tech news this week #shorts
Sophia Yang
Deep Learning with PyTorch Author Interview with Eli Stevens, Luca Antiga, and Thomas Viehmann
Sophia Yang
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI