[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Sophia Yang · Beginner ·📄 Research Papers Explained ·4y ago

Key Takeaways

The video demonstrates the use of CheckList to generate and test different scenarios for NLP models, testing models along two dimensions: capacity and testing type, and leveraging Hugging Face Transformers pipeline for question answering with a custom model.

Full Transcript

hello everyone um today i'd like to talk about this this paper that i really like beyond accuracy behavior testing of nlp models with checklist so i like this paper so much because it applies the unit testing mindset to do behavioral testing for nlp models i think the same mindset and framework can be potentially applied to to any other models um yeah so in this video i'm going to talk about this paper briefly and i'll show you uh how to use checklist package uh in a in a jupyter notebook so there are a lot of behavioral testing papers out there one of the paper this one is quite interesting that question answering tasks i used um so so the question so here's a question and here's the context so this paper gives the sentence why how because to kill american people edit at the end of the context and the model will predict to kill american people for those change sentences this is quite disturbing and really concerning to look at and there there are a lot of examples of this checklist provides a framework to automatically generate and test those different scenarios and examples to see where our our model fails um it tested models along two dimensions one is capacity which is a list of um tests it can run um another dimension is testing type which includes minimum functionality test invariance test and directional test so here are three examples of for sentiment analysis for minimum functionality tests we generate simple test cases from from a template and to see if the model predicts the same as the expected labels for the inverse test we do some small data on augmentations and expect the labels to stay the same in this example we change the location from chicago to dallas and the location change should not change the sentiment of the sentence and the third case is a directional directional test in in this example the author added a negative sentiment sentence at the end of each sentence and expect the sentiment to go down not going up so what's interesting that is that this paper tested on all of the state of art commercial models all of the all those models performs human behavior but when we look at those simple test cases it has a really high failure rate the first table is with sentiment analysis um there's another table with the sentence duplications and then finally we have this machine comprehension or question answering question answering tasks we see for for this machine comprehension task in particular the failure rate is really really high okay so uh how can we oh one other thing so there are a lot of um videos on this papers um so this is one of them i will link uh the resources and materials in the description um of this video um and then okay so how can we use this checklist um tool this checklist package uh what's up so this paper is amazing that not only does it have a lot of video tutorials but also it has a checklist package to help us implement this with our own models so this is their their github page you can pip install on the checklist and the needed notebook extensions in their notebook section there are a bunch of notebooks so i went through um some of them and here for the question answering task i was i was checking out this notebook and try to create a test suite and do some testing um the original code you can find here but in my example i changed the data also the model and some other small functions to to to be able to run my example smoothly and i will link all my notebooks in the descriptions below so you can take a look okay so how does this actually work when we need to first of all import all the needed packages and modules and here i'm using a hugging phase model this is how hacking phase works we give it a a transformers pipeline calling a question answering model if we don't provide this model it would default to us to do this model and load this model and then i'll give it a context in question this model will give us an answer with a confidence score if you have your own model you can load your model directly in this case i have a trained model we can load the model and the token from from this train model you can see the it gives us give us the same answer with a little better confidence score okay now coming to hugging face hugging face has this large i guess a word bank that uh will give us a list of um named entities or things i guess words for example it will give us a list of first names last names locations whatever but for things that checklist doesn't provide we can use editor dot suggest function and using this mask tag and then checklist will use a roberta model to help help us automatically generate those edge test adjectives we need um okay so with all those words we can use the editor.template function to automatically generate examples for our testing cases um yeah for example here we generated an example of alice's richer than joseph the question is who is richer and we expect the label to be alice but the model predicted joseph um so luckily in this example the failure rate is fairly low only three percent so our model is able to understand this context and this question and then give us the correct answer basically however if we change the question to who is less something then all of a sudden the failure rate increased to 99 that means our model cannot understand those sentence uh so this first example is with the minimum functionality test at mft the second type of test is the invariance test um in is in this example we give our question a typo so this question becomes this part of it's a typo basically and then we expect this environment function expect the the prediction between those two questions to be the same and in this case the predictions are not the same so it's one of the failure cases the failure rate in this case is 17 another interesting thing i want to show you is um that checklist can be a tool to test fairness um see if a model have any of the gender bias regional bias and and so on this example tested gender bias generates examples of male female is not a profession female male is so who is this profession so so this is the name of the female name of the male and then we see if the failure rate between male and female are different or similar across different professions okay so this notebook has a lot of test cases as you can see here um one of the thing that uh the author did was that we were able to save all those test cases um add them to a test suite and save those to a pickle file um so in another notebook as i'm showing here that we can just simply load the model load the test suite and run this test suite for all the tests um and then it provides a nice summary for all the tests and we can visualize this summary really nicely here are the two examples we're just talking about when zero percent was 100 um so and then we can see um failure examples here it's quite nice and now we can see uh the question typos with examples we also have like all the different examples we added a random sentence to the context and we see 10 of the time the model can produce the same result um so so this is how checklist works we generate tests um automatically uh based on the templates or we're doing various type tests and we create a test suite save to a pickle file and then we can run the test suite all together and visualize the results it's all great and then i was wondering what if we train our model based on those test cases so i actually combined the the stanford question answering data set and the all the test cases from my test fleet um and then i i i ran the test on a on the same test suite but with different words and adjectives and all that and surprisingly the failure rate dropped to zero percent um so which this means the model can be trained quite easily if we find the issues of the model um but finding the issue can be tricky and hard this is where checklist provides values to help help guide and help improve the model guide the researchers to find where the problems are so yeah um yeah i will link all the notebooks in the description so you can test it out and i encourage you to to check out this paper and also the repository of of this paper this is quite amazing i really like this paper hope you enjoy it thank you

Original Description

Notebooks: https://github.com/sophiamyang/NLP_testing Paper: Beyond Accuracy: Behavioral Testing of NLP models with CheckList Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020 Learning Resources: https://github.com/marcotcr/checklist https://slideslive.com/38929272/beyond-accuracy-behavioral-testing-of-nlp-models-with-checklist
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sophia Yang · Sophia Yang · 6 of 60

1 Customer lifetime value in a discrete-time contractual setting (math and Python implementation)
Customer lifetime value in a discrete-time contractual setting (math and Python implementation)
Sophia Yang
2 Time series analysis using Prophet in Python — Math explained
Time series analysis using Prophet in Python — Math explained
Sophia Yang
3 Multiclass logistic/softmax regression from scratch
Multiclass logistic/softmax regression from scratch
Sophia Yang
4 Deploy a Python Visualization Panel App to Google Cloud App Engine
Deploy a Python Visualization Panel App to Google Cloud App Engine
Sophia Yang
5 Deploy a Python Visualization Panel App to Google Cloud Run
Deploy a Python Visualization Panel App to Google Cloud Run
Sophia Yang
[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList
[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Sophia Yang
7 5-step data science workflow
5-step data science workflow
Sophia Yang
8 Multi-armed bandit algorithms - ETC Explore then Commit
Multi-armed bandit algorithms - ETC Explore then Commit
Sophia Yang
9 Multi-armed bandit algorithms - Epsilon greedy algorithm
Multi-armed bandit algorithms - Epsilon greedy algorithm
Sophia Yang
10 User retention analysis framework | data science product sense
User retention analysis framework | data science product sense
Sophia Yang
11 Visualization and Interactive Dashboard in Python: My favorite Python Viz tools — HoloViz
Visualization and Interactive Dashboard in Python: My favorite Python Viz tools — HoloViz
Sophia Yang
12 Multi-armed bandit algorithms: Thompson Sampling
Multi-armed bandit algorithms: Thompson Sampling
Sophia Yang
13 The Easiest Way to Create an Interactive Dashboard in Python
The Easiest Way to Create an Interactive Dashboard in Python
Sophia Yang
14 Big Data Visualization Using Datashader in Python | How does Datashader work and why is it so fast?
Big Data Visualization Using Datashader in Python | How does Datashader work and why is it so fast?
Sophia Yang
15 Why do you want to be a data scientist? Don't be a data scientist if ...
Why do you want to be a data scientist? Don't be a data scientist if ...
Sophia Yang
16 Johnny Depp v Amber Heard Twitter Sentiment Analysis | Is Camille Vasquez the real winner | 🤗 NLP
Johnny Depp v Amber Heard Twitter Sentiment Analysis | Is Camille Vasquez the real winner | 🤗 NLP
Sophia Yang
17 How to build a product that sells itself | Product-led Growth | Book Summary | Read a book with me
How to build a product that sells itself | Product-led Growth | Book Summary | Read a book with me
Sophia Yang
18 Designing Machine Learning Systems | book summary | Read a book with me
Designing Machine Learning Systems | book summary | Read a book with me
Sophia Yang
19 Where do data scientists/analysts go next? Love and hate in data analytics (ft. Shashank Kalanithi)
Where do data scientists/analysts go next? Love and hate in data analytics (ft. Shashank Kalanithi)
Sophia Yang
20 Meet the Author: Fundamentals of Data Engineering | DS/ML book club
Meet the Author: Fundamentals of Data Engineering | DS/ML book club
Sophia Yang
21 What's new in hvPlot releases 0.8.0 & 0.8.1?
What's new in hvPlot releases 0.8.0 & 0.8.1?
Sophia Yang
22 Meet the Author: Machine Learning Design Patterns | What do ML/Research Engineers do at Google?
Meet the Author: Machine Learning Design Patterns | What do ML/Research Engineers do at Google?
Sophia Yang
23 Machine Learning Design Patterns | Google Executive | Investor | Meet the Author
Machine Learning Design Patterns | Google Executive | Investor | Meet the Author
Sophia Yang
24 How to solve data quality issues | Data Reliability | Meet the Author
How to solve data quality issues | Data Reliability | Meet the Author
Sophia Yang
25 Reliable Machine Learning author interview | DS/ML book club
Reliable Machine Learning author interview | DS/ML book club
Sophia Yang
26 Toronto VLOG | First vlog | Meet my favorite author | Toronto ML Summit conference
Toronto VLOG | First vlog | Meet my favorite author | Toronto ML Summit conference
Sophia Yang
27 TOP 6 tech news in 2022 #shorts
TOP 6 tech news in 2022 #shorts
Sophia Yang
28 How to deploy a Panel app to Hugging Face using Docker?
How to deploy a Panel app to Hugging Face using Docker?
Sophia Yang
29 Tech news this week | ChatGPT, Hacks, Snowflake, CES #shorts
Tech news this week | ChatGPT, Hacks, Snowflake, CES #shorts
Sophia Yang
30 🗞️ Tech news this week: ChatGPT, DreamerV3, Muse, VALL-E, Mineral, DoNotPay, Tesla, SBF... #shorts
🗞️ Tech news this week: ChatGPT, DreamerV3, Muse, VALL-E, Mineral, DoNotPay, Tesla, SBF... #shorts
Sophia Yang
31 Tech news this week | Boston Dynamics, Microsoft, Snowflake, Google, and more #shorts
Tech news this week | Boston Dynamics, Microsoft, Snowflake, Google, and more #shorts
Sophia Yang
32 The story of Metaflow | Effective Data Science Infrastructure | Book author interview
The story of Metaflow | Effective Data Science Infrastructure | Book author interview
Sophia Yang
33 Tech news this week #shorts
Tech news this week #shorts
Sophia Yang
34 A day in life of a data scientist | Data Day Texas | Interview 12 authors/speakers
A day in life of a data scientist | Data Day Texas | Interview 12 authors/speakers
Sophia Yang
35 Tech news this week #shorts
Tech news this week #shorts
Sophia Yang
36 Explainable AI with Shapley Values (Part 1: Game Theory)
Explainable AI with Shapley Values (Part 1: Game Theory)
Sophia Yang
37 Explainable AI with Shapley Values (Part 2: Estimate Shapley Values)
Explainable AI with Shapley Values (Part 2: Estimate Shapley Values)
Sophia Yang
38 Explainable AI with Shapley Values (Part 3: KernelSHAP)
Explainable AI with Shapley Values (Part 3: KernelSHAP)
Sophia Yang
39 Tech news this week | AI search war between Microsoft and Google #shorts
Tech news this week | AI search war between Microsoft and Google #shorts
Sophia Yang
40 The Story of ChatGPT's creator OpenAI | From Riches to Fame
The Story of ChatGPT's creator OpenAI | From Riches to Fame
Sophia Yang
41 Explainable AI for Practitioners | Must-read for XAI | author interview
Explainable AI for Practitioners | Must-read for XAI | author interview
Sophia Yang
42 Train your own language model with nanoGPT | Let’s build a songwriter
Train your own language model with nanoGPT | Let’s build a songwriter
Sophia Yang
43 The easiest way to work with large language models | Learn LangChain in 10min
The easiest way to work with large language models | Learn LangChain in 10min
Sophia Yang
44 The BEST browser? AI article summary, image generation, website insights. Microsoft Edge Copilot!
The BEST browser? AI article summary, image generation, website insights. Microsoft Edge Copilot!
Sophia Yang
45 startup scene in data | insights from 50+ data startups from Data Council
startup scene in data | insights from 50+ data startups from Data Council
Sophia Yang
46 NLP with Transformers author interview with Lewis Tunstall from Hugging Face
NLP with Transformers author interview with Lewis Tunstall from Hugging Face
Sophia Yang
47 4 ways to do question answering in LangChain | chat with long PDF docs | BEST method
4 ways to do question answering in LangChain | chat with long PDF docs | BEST method
Sophia Yang
48 5 Steps to Build a Question Answering PDF Chatbot: LangChain + OpenAI + Panel + HuggingFace.
5 Steps to Build a Question Answering PDF Chatbot: LangChain + OpenAI + Panel + HuggingFace.
Sophia Yang
49 4 Autonomous AI Agents: “Westworld” simulation, Camel, BabyAGI, AutoGPT, Camel ⭐ LangChain ⭐
4 Autonomous AI Agents: “Westworld” simulation, Camel, BabyAGI, AutoGPT, Camel ⭐ LangChain ⭐
Sophia Yang
50 MiniGPT4: image understanding & open-source!
MiniGPT4: image understanding & open-source!
Sophia Yang
51 BEST Practices in Prompt Engineering: Learnings and Thoughts from Andrew Ng's New Course
BEST Practices in Prompt Engineering: Learnings and Thoughts from Andrew Ng's New Course
Sophia Yang
52 Designing Machine Learning Systems author interview with Chip Huyen
Designing Machine Learning Systems author interview with Chip Huyen
Sophia Yang
53 Tech news this week: code interpreter, Mojo, Redpajama, MPT7b, StarCoder #shorts
Tech news this week: code interpreter, Mojo, Redpajama, MPT7b, StarCoder #shorts
Sophia Yang
54 🤗 Hugging Face Transformers Agent | LangChain comparisons
🤗 Hugging Face Transformers Agent | LangChain comparisons
Sophia Yang
55 📢 Tech news this week #shorts
📢 Tech news this week #shorts
Sophia Yang
56 📢 Tech news this week #shorts
📢 Tech news this week #shorts
Sophia Yang
57 The BEST ChatGPT Plugins | Brand NEW Bing Search | Web browsing, CODING, summarizing, and more
The BEST ChatGPT Plugins | Brand NEW Bing Search | Web browsing, CODING, summarizing, and more
Sophia Yang
58 Tech news this week #shorts #short
Tech news this week #shorts #short
Sophia Yang
59 📢 Tech news this week #shorts
📢 Tech news this week #shorts
Sophia Yang
60 Deep Learning with PyTorch Author Interview with Eli Stevens, Luca Antiga, and Thomas Viehmann
Deep Learning with PyTorch Author Interview with Eli Stevens, Luca Antiga, and Thomas Viehmann
Sophia Yang

This video teaches viewers how to use CheckList to generate and test different scenarios for NLP models, and how to leverage Hugging Face Transformers pipeline for question answering with a custom model. The video also covers the importance of behavioral testing and evaluation of NLP models.

Key Takeaways
  1. Import needed packages and modules
  2. Use CheckList to generate and test different scenarios
  3. Test models along two dimensions: capacity and testing type
  4. Run minimum functionality test, invariance test, and directional test
  5. Load a custom model into the Hugging Face pipeline
  6. Use Hugging Face's named entity recognition to handle edge cases
  7. Generate testing cases using CheckList templates
  8. Run invariance tests to check model robustness
  9. Use CheckList to test fairness and bias in models
💡 CheckList helps guide researchers to find model problems and improves the robustness and fairness of NLP models

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →