[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Sophia Yang · Beginner ·📄 Research Papers Explained ·4y ago

Skills: Reading ML Papers90%Research Methods80%RAG Evaluation80%RAG Basics60%Vector Stores50%

Key Takeaways

The video demonstrates the use of CheckList to generate and test different scenarios for NLP models, testing models along two dimensions: capacity and testing type, and leveraging Hugging Face Transformers pipeline for question answering with a custom model.

Full Transcript

hello everyone um today i'd like to talk about this this paper that i really like beyond accuracy behavior testing of nlp models with checklist so i like this paper so much because it applies the unit testing mindset to do behavioral testing for nlp models i think the same mindset and framework can be potentially applied to to any other models um yeah so in this video i'm going to talk about this paper briefly and i'll show you uh how to use checklist package uh in a in a jupyter notebook so there are a lot of behavioral testing papers out there one of the paper this one is quite interesting that question answering tasks i used um so so the question so here's a question and here's the context so this paper gives the sentence why how because to kill american people edit at the end of the context and the model will predict to kill american people for those change sentences this is quite disturbing and really concerning to look at and there there are a lot of examples of this checklist provides a framework to automatically generate and test those different scenarios and examples to see where our our model fails um it tested models along two dimensions one is capacity which is a list of um tests it can run um another dimension is testing type which includes minimum functionality test invariance test and directional test so here are three examples of for sentiment analysis for minimum functionality tests we generate simple test cases from from a template and to see if the model predicts the same as the expected labels for the inverse test we do some small data on augmentations and expect the labels to stay the same in this example we change the location from chicago to dallas and the location change should not change the sentiment of the sentence and the third case is a directional directional test in in this example the author added a negative sentiment sentence at the end of each sentence and expect the sentiment to go down not going up so what's interesting that is that this paper tested on all of the state of art commercial models all of the all those models performs human behavior but when we look at those simple test cases it has a really high failure rate the first table is with sentiment analysis um there's another table with the sentence duplications and then finally we have this machine comprehension or question answering question answering tasks we see for for this machine comprehension task in particular the failure rate is really really high okay so uh how can we oh one other thing so there are a lot of um videos on this papers um so this is one of them i will link uh the resources and materials in the description um of this video um and then okay so how can we use this checklist um tool this checklist package uh what's up so this paper is amazing that not only does it have a lot of video tutorials but also it has a checklist package to help us implement this with our own models so this is their their github page you can pip install on the checklist and the needed notebook extensions in their notebook section there are a bunch of notebooks so i went through um some of them and here for the question answering task i was i was checking out this notebook and try to create a test suite and do some testing um the original code you can find here but in my example i changed the data also the model and some other small functions to to to be able to run my example smoothly and i will link all my notebooks in the descriptions below so you can take a look okay so how does this actually work when we need to first of all import all the needed packages and modules and here i'm using a hugging phase model this is how hacking phase works we give it a a transformers pipeline calling a question answering model if we don't provide this model it would default to us to do this model and load this model and then i'll give it a context in question this model will give us an answer with a confidence score if you have your own model you can load your model directly in this case i have a trained model we can load the model and the token from from this train model you can see the it gives us give us the same answer with a little better confidence score okay now coming to hugging face hugging face has this large i guess a word bank that uh will give us a list of um named entities or things i guess words for example it will give us a list of first names last names locations whatever but for things that checklist doesn't provide we can use editor dot suggest function and using this mask tag and then checklist will use a roberta model to help help us automatically generate those edge test adjectives we need um okay so with all those words we can use the editor.template function to automatically generate examples for our testing cases um yeah for example here we generated an example of alice's richer than joseph the question is who is richer and we expect the label to be alice but the model predicted joseph um so luckily in this example the failure rate is fairly low only three percent so our model is able to understand this context and this question and then give us the correct answer basically however if we change the question to who is less something then all of a sudden the failure rate increased to 99 that means our model cannot understand those sentence uh so this first example is with the minimum functionality test at mft the second type of test is the invariance test um in is in this example we give our question a typo so this question becomes this part of it's a typo basically and then we expect this environment function expect the the prediction between those two questions to be the same and in this case the predictions are not the same so it's one of the failure cases the failure rate in this case is 17 another interesting thing i want to show you is um that checklist can be a tool to test fairness um see if a model have any of the gender bias regional bias and and so on this example tested gender bias generates examples of male female is not a profession female male is so who is this profession so so this is the name of the female name of the male and then we see if the failure rate between male and female are different or similar across different professions okay so this notebook has a lot of test cases as you can see here um one of the thing that uh the author did was that we were able to save all those test cases um add them to a test suite and save those to a pickle file um so in another notebook as i'm showing here that we can just simply load the model load the test suite and run this test suite for all the tests um and then it provides a nice summary for all the tests and we can visualize this summary really nicely here are the two examples we're just talking about when zero percent was 100 um so and then we can see um failure examples here it's quite nice and now we can see uh the question typos with examples we also have like all the different examples we added a random sentence to the context and we see 10 of the time the model can produce the same result um so so this is how checklist works we generate tests um automatically uh based on the templates or we're doing various type tests and we create a test suite save to a pickle file and then we can run the test suite all together and visualize the results it's all great and then i was wondering what if we train our model based on those test cases so i actually combined the the stanford question answering data set and the all the test cases from my test fleet um and then i i i ran the test on a on the same test suite but with different words and adjectives and all that and surprisingly the failure rate dropped to zero percent um so which this means the model can be trained quite easily if we find the issues of the model um but finding the issue can be tricky and hard this is where checklist provides values to help help guide and help improve the model guide the researchers to find where the problems are so yeah um yeah i will link all the notebooks in the description so you can test it out and i encourage you to to check out this paper and also the repository of of this paper this is quite amazing i really like this paper hope you enjoy it thank you

Original Description

Notebooks: https://github.com/sophiamyang/NLP_testing Paper: Beyond Accuracy: Behavioral Testing of NLP models with CheckList Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020 Learning Resources: https://github.com/marcotcr/checklist https://slideslive.com/38929272/beyond-accuracy-behavioral-testing-of-nlp-models-with-checklist

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sophia Yang · Sophia Yang · 6 of 60

← Previous Next →

Customer lifetime value in a discrete-time contractual setting (math and Python implementation)

Customer lifetime value in a discrete-time contractual setting (math and Python implementation)

Time series analysis using Prophet in Python — Math explained

Time series analysis using Prophet in Python — Math explained

Multiclass logistic/softmax regression from scratch

Multiclass logistic/softmax regression from scratch

Deploy a Python Visualization Panel App to Google Cloud App Engine

Deploy a Python Visualization Panel App to Google Cloud App Engine

Deploy a Python Visualization Panel App to Google Cloud Run

Deploy a Python Visualization Panel App to Google Cloud Run

[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList

[Read a paper (with code)] Beyond Accuracy: Behavioral Testing of NLP models with CheckList

5-step data science workflow

5-step data science workflow

Multi-armed bandit algorithms - ETC Explore then Commit

Multi-armed bandit algorithms - ETC Explore then Commit

Multi-armed bandit algorithms - Epsilon greedy algorithm

Multi-armed bandit algorithms - Epsilon greedy algorithm

User retention analysis framework | data science product sense

User retention analysis framework | data science product sense

Visualization and Interactive Dashboard in Python: My favorite Python Viz tools — HoloViz

Visualization and Interactive Dashboard in Python: My favorite Python Viz tools — HoloViz

Multi-armed bandit algorithms: Thompson Sampling

Multi-armed bandit algorithms: Thompson Sampling

The Easiest Way to Create an Interactive Dashboard in Python

The Easiest Way to Create an Interactive Dashboard in Python

Big Data Visualization Using Datashader in Python | How does Datashader work and why is it so fast?

Big Data Visualization Using Datashader in Python | How does Datashader work and why is it so fast?

Why do you want to be a data scientist? Don't be a data scientist if ...

Why do you want to be a data scientist? Don't be a data scientist if ...

Johnny Depp v Amber Heard Twitter Sentiment Analysis | Is Camille Vasquez the real winner | 🤗 NLP

Johnny Depp v Amber Heard Twitter Sentiment Analysis | Is Camille Vasquez the real winner | 🤗 NLP

How to build a product that sells itself | Product-led Growth | Book Summary | Read a book with me

How to build a product that sells itself | Product-led Growth | Book Summary | Read a book with me

Designing Machine Learning Systems | book summary | Read a book with me

Designing Machine Learning Systems | book summary | Read a book with me

Where do data scientists/analysts go next? Love and hate in data analytics (ft. Shashank Kalanithi)

Where do data scientists/analysts go next? Love and hate in data analytics (ft. Shashank Kalanithi)

Meet the Author: Fundamentals of Data Engineering | DS/ML book club

Meet the Author: Fundamentals of Data Engineering | DS/ML book club

What's new in hvPlot releases 0.8.0 & 0.8.1?

What's new in hvPlot releases 0.8.0 & 0.8.1?

Meet the Author: Machine Learning Design Patterns | What do ML/Research Engineers do at Google?

Meet the Author: Machine Learning Design Patterns | What do ML/Research Engineers do at Google?

Machine Learning Design Patterns | Google Executive | Investor | Meet the Author

Machine Learning Design Patterns | Google Executive | Investor | Meet the Author

How to solve data quality issues | Data Reliability | Meet the Author

How to solve data quality issues | Data Reliability | Meet the Author

Reliable Machine Learning author interview | DS/ML book club

Reliable Machine Learning author interview | DS/ML book club

Toronto VLOG | First vlog | Meet my favorite author | Toronto ML Summit conference

Toronto VLOG | First vlog | Meet my favorite author | Toronto ML Summit conference

TOP 6 tech news in 2022 #shorts

TOP 6 tech news in 2022 #shorts

How to deploy a Panel app to Hugging Face using Docker?

How to deploy a Panel app to Hugging Face using Docker?

Tech news this week | ChatGPT, Hacks, Snowflake, CES #shorts

Tech news this week | ChatGPT, Hacks, Snowflake, CES #shorts

🗞️ Tech news this week: ChatGPT, DreamerV3, Muse, VALL-E, Mineral, DoNotPay, Tesla, SBF... #shorts

🗞️ Tech news this week: ChatGPT, DreamerV3, Muse, VALL-E, Mineral, DoNotPay, Tesla, SBF... #shorts

Tech news this week | Boston Dynamics, Microsoft, Snowflake, Google, and more #shorts

Tech news this week | Boston Dynamics, Microsoft, Snowflake, Google, and more #shorts

The story of Metaflow | Effective Data Science Infrastructure | Book author interview

The story of Metaflow | Effective Data Science Infrastructure | Book author interview

Tech news this week #shorts

Tech news this week #shorts

A day in life of a data scientist | Data Day Texas | Interview 12 authors/speakers

A day in life of a data scientist | Data Day Texas | Interview 12 authors/speakers

Tech news this week #shorts

Tech news this week #shorts

Explainable AI with Shapley Values (Part 1: Game Theory)

Explainable AI with Shapley Values (Part 1: Game Theory)

Explainable AI with Shapley Values (Part 2: Estimate Shapley Values)

Explainable AI with Shapley Values (Part 2: Estimate Shapley Values)

Explainable AI with Shapley Values (Part 3: KernelSHAP)

Explainable AI with Shapley Values (Part 3: KernelSHAP)

Tech news this week | AI search war between Microsoft and Google #shorts

Tech news this week | AI search war between Microsoft and Google #shorts

The Story of ChatGPT's creator OpenAI | From Riches to Fame

The Story of ChatGPT's creator OpenAI | From Riches to Fame

Explainable AI for Practitioners | Must-read for XAI | author interview

Explainable AI for Practitioners | Must-read for XAI | author interview

Train your own language model with nanoGPT | Let’s build a songwriter

Train your own language model with nanoGPT | Let’s build a songwriter

The easiest way to work with large language models | Learn LangChain in 10min

The easiest way to work with large language models | Learn LangChain in 10min

The BEST browser? AI article summary, image generation, website insights. Microsoft Edge Copilot!

The BEST browser? AI article summary, image generation, website insights. Microsoft Edge Copilot!

startup scene in data | insights from 50+ data startups from Data Council

startup scene in data | insights from 50+ data startups from Data Council

NLP with Transformers author interview with Lewis Tunstall from Hugging Face

NLP with Transformers author interview with Lewis Tunstall from Hugging Face

4 ways to do question answering in LangChain | chat with long PDF docs | BEST method

4 ways to do question answering in LangChain | chat with long PDF docs | BEST method

5 Steps to Build a Question Answering PDF Chatbot: LangChain + OpenAI + Panel + HuggingFace.

5 Steps to Build a Question Answering PDF Chatbot: LangChain + OpenAI + Panel + HuggingFace.

4 Autonomous AI Agents: “Westworld” simulation, Camel, BabyAGI, AutoGPT, Camel ⭐ LangChain ⭐

4 Autonomous AI Agents: “Westworld” simulation, Camel, BabyAGI, AutoGPT, Camel ⭐ LangChain ⭐

MiniGPT4: image understanding & open-source!

MiniGPT4: image understanding & open-source!

BEST Practices in Prompt Engineering: Learnings and Thoughts from Andrew Ng's New Course

BEST Practices in Prompt Engineering: Learnings and Thoughts from Andrew Ng's New Course

Designing Machine Learning Systems author interview with Chip Huyen

Designing Machine Learning Systems author interview with Chip Huyen

Tech news this week: code interpreter, Mojo, Redpajama, MPT7b, StarCoder #shorts

Tech news this week: code interpreter, Mojo, Redpajama, MPT7b, StarCoder #shorts

🤗 Hugging Face Transformers Agent | LangChain comparisons

🤗 Hugging Face Transformers Agent | LangChain comparisons

📢 Tech news this week #shorts

📢 Tech news this week #shorts

📢 Tech news this week #shorts

📢 Tech news this week #shorts

The BEST ChatGPT Plugins | Brand NEW Bing Search | Web browsing, CODING, summarizing, and more

Tech news this week #shorts #short

Tech news this week #shorts #short

📢 Tech news this week #shorts

📢 Tech news this week #shorts

Deep Learning with PyTorch Author Interview with Eli Stevens, Luca Antiga, and Thomas Viehmann

Deep Learning with PyTorch Author Interview with Eli Stevens, Luca Antiga, and Thomas Viehmann

This video teaches viewers how to use CheckList to generate and test different scenarios for NLP models, and how to leverage Hugging Face Transformers pipeline for question answering with a custom model. The video also covers the importance of behavioral testing and evaluation of NLP models.

Key Takeaways

Import needed packages and modules
Use CheckList to generate and test different scenarios
Test models along two dimensions: capacity and testing type
Run minimum functionality test, invariance test, and directional test
Load a custom model into the Hugging Face pipeline
Use Hugging Face's named entity recognition to handle edge cases
Generate testing cases using CheckList templates
Run invariance tests to check model robustness
Use CheckList to test fairness and bias in models

💡 CheckList helps guide researchers to find model problems and improves the robustness and fairness of NLP models

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling