OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
Key Takeaways
The video discusses OpenAI's CriticGPT, a novel approach to enhancing the reliability of AI-generated content, and its ability to identify errors in code generated by ChatGPT, with tools such as OpenAI GPT-4, Auto regressive policy, and RLHF (Reinforcement Learning from Human Feedback).
Full Transcript
hey folks welcome to this new series called reading papers with herid where we simply grab a cup of tea or coffee whatever you like and I will walk you through the important details Concepts explained in research papers so get your cup ready as today we're going to talk about the paper that was released by openai last week called llm critics help catch llm bugs where they released their new model called critic GPD now what is critic GPD why is it important what are the methods that they have followed What are the kind of results that they have seen we're going to talk about everything so let's dive in so critic GPD represents a novel approach to enhancing the reliability of air generated content now this Innovative model which is part of the gbd4 family this is specifically designed to help human reviewers in detecting and critiquing errors in the code which is produced by Char GPT and here they have mentioned that on code containing naturally occurring llm errors the model written critiques are preferred over human critiques in 63% of cases and here further you can see that human evaluation finds that models catch more bugs than human contractors which are paid for code reviews so this has Pro provided that method which basically solves this growing challenge of evaluating increasingly sophisticated AI outputs particularly as large language models become more and more complex so heading over to the introduction section what makes the most capable AI systems effective today we know that they are all trained with reinforcement learning from Human feedback RL HF now this method leverages the fact that evaluating AI output is is usually faster and easier for humans than demonstrating the perfect output themselves so as AI models become more and more advanced now what is happening is even seasoned experts are struggling to reliably assess their outputs now this limitation of human evaluation is a fundamental issue with rlf and the field of scalable oversight aims to solve this by training models to help humans evaluate AI output more effectively and previous research has shown that methods like debate can help humans better assess answers to reading comprehension questions but now it's time to assess these models in more realistic settings so how can scalable oversight help humans assess model written solutions to real world tasks and for the first time the research demonstrates that scalable oversight can help humans more comprehensively assess these Solutions particularly in writing in code now the core idea over here is simple they trained an auto regressive policy to take a question and answer pair and then output a text critique pointing out errors using rlf on challenging real world data so they developed a GPD 4 based critic model which is called critic GPD which outperforms humans at detecting bugs here you can see in this figure they've shown that llms catch significantly more inserted bugs than humans and that model critiques are preferred over human critiques more than 80% of the time and human machine teams basically combining humans with critic GPD the model that they train those combinations this combination of humans and critic GPD they write more comprehensive critiques and avoid the nitpicks and hallucination better than the models alone the contribution of This research basically includes demonstrating a scalable oversight method for real world rlf data which showcases critic gpd's Superior bug detection and critique preference and then highlights the effectiveness of human machine teams and further you know introduces uh a technique called Force sampling beam search fsbs to balance real and spous issues in uh critiques which we'll see in a bit now let's talk about the methods that have been Incorporated in order to train this model the llm critics are basically Auto regressive Transformer models similar to your chart gbd they take question answer pair as input and generate a text critique which highlights the potential problems now the critiques include comments attached to specific quotes from the answer and to talk about the evaluation methods that they have the steps that they have taken first of all if we talk about the critique attributes now the critiques include comments contractors now are hired to evaluate these critique based on comprehensiveness inclusion of specific bugs uh which is CBI presence of hallucinated bugs or nitpicks and overall subjective helpfulness and they rate them on a 1 to S scale where one being least helpful and seven highlights the most helpful comment or critique now how are critiques compared and rated contractors compare four critiques of the same problem without knowing their sources and score them on various attributes which we talked about this comparison gives a clearer estimate of the model performance the scores are then summarized using ELO scores which estimate the win rate of one model's critique over anothers and confidence intervals uh for these scores are provided using a non-parametric bootstrap method let's talk about the data distribution how did they prepare the data for this model almost all training and evaluation data comes from the openi rhf pipeline consisting of question and answer Pairs and they focused on cases where the model's response was at least 50% python by line count using the largest code block as the answer and discarding the rest now they worked with two sets of data containing known bugs first the human inserted bugs where the contractors were paid to insert bugs into the answers and second the human detected bugs where the programming tasks flagged by original rers in the CH GPD labeling pipeline now human detected bugs are more natural and usually easier for humans to spot since they were already caught once on the other hand inserted bugs are tougher and more severe all right now the meaty part of this paper how are llm critics trained using rlf now the process follows a pipeline similar to charity first several critiques are sampled for each question and answer pair contractors then rate these critiques on various attributes including overall quality now a reward model is trained to predict these quality rankings and a policy is optimized against this model using pop now finally an inference time sampling strategy which is called your fsbs or Force sampling beam search is applied now additionally there's an optional First Step called tampering now a large fraction of the data actually comes from this particular step here the contractors introduce subtle bugs into the model written answers and explain the problems they introduced now this adversarial approach makes the bugs harder to detect and contractors also verify that an LM critic misses these bugs at least once in three samples and to validate discriminators are trained to predict whether the code has been tampered with or not now the tampering procedure provides the majority of training data making it more similar to human inserted bugs than human detected bugs now what is this Force sampling beam search fsbs approach that they have introduced and how have they used this in addition to rlf fsbs combines a reward model with search to generate longer and more comprehensive critiques reducing hallucinations and nitpicks because whenever you whenever these models are critiquing a long data long code snippet they tend to hallucinate and make up answers so we can break down fsbs in three parts first is the sampling during critique generation fsbs forces the model to produce highlighted sections of the input text that are quoted in the critique now this is done through constraint sampling where the model is encouraged to select text spans or code Snippets that are likely to contain bugs or issues then we have beam search now this model uses beam search to explore multiple candidate uh critiques in parallel and this allows it to generate longer more comprehensive critiques by considering a wider range of possible highlighted sections and the comments that are written alongside then finally we have the scoring model now this is basically just a scoring function that balances two objectives one is your RM score the RM score is the reward model which indicates how well the critique uh points out bugs or issues and then the second parameter is length modifier uh multiplied with number of highlights so this is a length penalty that discourages the model from generating overly long critiques with excessive hallucination or nitex so this fsbs model then basically provides a balance or you know manages the tradeoff with the hallucinations and nitpix and this was a new new method a new approach that they have introduced in this paper now let's look at the kind of results that this whole process has produced llm critiques especially those by critic GPD are often preferred over human critiques and they catch more inserted bugs that's what they're claiming and contractors significantly favor critic gpd's critiques over those from prompted chart GPT and human written critiques the tamper plus the rlf pipeline that uh they have produced this has improved the bug detection rate with llm critics catching more bugs than human annotators and When contractors didn't have llm assistants they made fewer comments possibly due to the need for specific domain knowledge and difficulty of some tasks now these factors explain why llms outperform human contractors and additionally critic gpd's reinforcement learning training this enhances model performance across different scales and they evaluated a family of gp4 mods and found that increasing the mod size improves bug detection to match critic gpd's performance without extra data or modifying the chat GPD pipeline uh about 30 times more pre-training compute would be needed in order to achieve that critic GPD helps humans create more comprehensive critiques especially in human plus critic GPD teams it outperforms the Char GPT on both human detected and inserted bugs while models have higher rates of nit pegs and hallucinations the human plus critic GPD combination these this teams when they work together they strike a balance now while chat GPD was trained with more data and compute critic GPD was was tested with a similar setup for a fair comparison and critic GPD showed higher precision and recall on human detected bugs proving more effective for code critique models without tamper data underperformed likely due to lower agreement rates and a less effective reward model as well llm critics generalize Beyond code and yes they can critic GPD was tested on General assistant Tas ask S from chat GPD training data marked as Flawless now in 24% of cases critic GPD identified problems that lowered the rating significantly compared to only 6% without critiques Now using critique reward models to prioritize tasks also improved problem detection and reduced the hallucinations so what's the key takeaway about large language models it's that they have become so Advanced that typical humans can't consistently evaluate their output without help and this highlights the growing need for scalable oversight methods whether rlf Remains the primary uh post trining method or not we must ensure that the model uh outputs are trustworthy so the approach here is straightforward training models to help humans evaluate other models and these llm critics are already successful in catching bugs in real world data and even accessible models like chat GPT uh can significantly assist these human annotators and contractors as llm intelligence continues to improve finding scalable methods to reward these uh right behaviors in AI systems is going to be crucial and llm critics uh show promise as a starting point all right uh so did you guys like it if you did give it a thumbs up let me know in the comments what other variations or you know details I should add uh how can we make it more interactive would love your feedback uh on the series and uh do tell me the kind of papers that we should read together uh if you want to give suggestions you can also join my Discord uh Community where we keep uh sharing these sort of uh resources but yeah that's it for this time I'll catch you guys in the next one until then keep learning keep building
Original Description
OpenAI has unveiled CriticGPT, a new AI model based on GPT-4 designed to identify errors in code generated by ChatGPT, marking a significant step towards improving the accuracy and reliability of AI-generated outputs.
Link to the paper: https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf
---
## Want to get more development-oriented insights? Subscribe to my Newsletter to stay up to date on such updates in the world of AI
- High Signal AI Newsletter: https://highsignalai.substack.com/
- High Signal AI Instagram: https://www.instagram.com/highsignal_ai/
## AI Engineer Roadmap
- Roadmap video: https://youtu.be/br8u4JwXMBU
- Roadmap GitHub (don't forget to leave a star): https://github.com/dswh/ai-engineer-roadmap
## Social Media & Discord Server Invitation
Follow me for more AI Engineering resources, tutorials, and reviews:
- LinkedIn: https://www.linkedin.com/in/tyagiharshit/
- X / Twitter: https://twitter.com/dswharshit
- Join the Discord community for ideas, discussion, reviews, and more: https://discord.gg/rssxJV2Xkz
## Chapters
0:00 Intro and tea
00:17 📄 Critic GPT overview and significance
01:35 🧠 Challenges in evaluating AI outputs
04:23 🛠️ Methods for training LLM critics
06:51 🔄 Training with RLF and FSBS approach
10:08 📊 Results and implications of Critic GPT
13:50 What should we read next!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Harshit Tyagi · Harshit Tyagi · 57 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
▶
58
59
60
Your PATH to learning Data Science
Harshit Tyagi
Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Harshit Tyagi
Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Harshit Tyagi
Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Harshit Tyagi
Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists
Harshit Tyagi
Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Harshit Tyagi
Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Harshit Tyagi
NumPy Essentials for Data Science - part-1 | One Dimensional Array
Harshit Tyagi
NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
Harshit Tyagi
Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Harshit Tyagi
Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Harshit Tyagi
Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Harshit Tyagi
Python vs R | The BEST programming language for your Data Science Project
Harshit Tyagi
Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Harshit Tyagi
The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
Harshit Tyagi
Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Harshit Tyagi
Speeding up your Data Analysis | Hacks & Libraries
Harshit Tyagi
How to build an Effective Data Science Portfolio
Harshit Tyagi
End-to-End Machine Learning Project Tutorial - Part 1
Harshit Tyagi
Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Harshit Tyagi
Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Harshit Tyagi
Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Harshit Tyagi
Three Decades of Practising Data Science | Interview with Dean Abbott
Harshit Tyagi
Calculating Vector Norms - Linear Algebra for Data Science - IV
Harshit Tyagi
Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Harshit Tyagi
Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Harshit Tyagi
Building projects with fastai - From Model Training to Deployment
Harshit Tyagi
October AI - Video Calling with One-Tenth of Internet Bandwidth
Harshit Tyagi
November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
Harshit Tyagi
Data Science learning roadmap for 2021
Harshit Tyagi
Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Harshit Tyagi
Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Harshit Tyagi
Tableau vs Python - Building a COVID tracker dashboard
Harshit Tyagi
[Explained] What is MLOps | Getting started with ML Engineering
Harshit Tyagi
Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Harshit Tyagi
Five hard truths about building a career in Data Science
Harshit Tyagi
Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Harshit Tyagi
Foundations for Data Science & ML - First steps for every beginner!
Harshit Tyagi
Course Outline - Foundations for Data Science & ML
Harshit Tyagi
How Machine Learning uses Linear Algebra to solve data problems
Harshit Tyagi
Calculus for ML - How much you should know to get started
Harshit Tyagi
Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Harshit Tyagi
AI Engineer - The next big tech role!
Harshit Tyagi
AI researcher vs AI engineer | The next big tech role!
Harshit Tyagi
Reviewing LLMs for content creation
Harshit Tyagi
Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering
Harshit Tyagi
High Signal AI - the most action-oriented newsletter on the web! #ai
Harshit Tyagi
Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Harshit Tyagi
Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes
Harshit Tyagi
Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Harshit Tyagi
Watch the full video on my channel - Roadmap to become an AI Engineer.
Harshit Tyagi
Mesop - Python-based UI framework from Google!
Harshit Tyagi
How I automated my YouTube | Gumloop tutorial | No Code
Harshit Tyagi
ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
Harshit Tyagi
Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Harshit Tyagi
Claude #AI artifacts are just amazing!
Harshit Tyagi
OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
Harshit Tyagi
Day in my life | Vlog #1
Harshit Tyagi
How to add AI Copilot to your application using CopilotKit | Tutorial
Harshit Tyagi
Quick Questions with an AI Founder - Anudeep Yegireddi
Harshit Tyagi
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Medium · LLM
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Dev.to · Stanislav
Chapters (7)
Intro and tea
0:17
📄 Critic GPT overview and significance
1:35
🧠 Challenges in evaluating AI outputs
4:23
🛠️ Methods for training LLM critics
6:51
🔄 Training with RLF and FSBS approach
10:08
📊 Results and implications of Critic GPT
13:50
What should we read next!
🎓
Tutor Explanation
DeepCamp AI