OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me

Harshit Tyagi · Beginner ·🧠 Large Language Models ·1y ago

Skills: LLM Engineering80%Fine-tuning LLMs70%RAG Basics60%RAG Evaluation50%

Key Takeaways

The video discusses OpenAI's CriticGPT, a novel approach to enhancing the reliability of AI-generated content, and its ability to identify errors in code generated by ChatGPT, with tools such as OpenAI GPT-4, Auto regressive policy, and RLHF (Reinforcement Learning from Human Feedback).

Full Transcript

hey folks welcome to this new series called reading papers with herid where we simply grab a cup of tea or coffee whatever you like and I will walk you through the important details Concepts explained in research papers so get your cup ready as today we're going to talk about the paper that was released by openai last week called llm critics help catch llm bugs where they released their new model called critic GPD now what is critic GPD why is it important what are the methods that they have followed What are the kind of results that they have seen we're going to talk about everything so let's dive in so critic GPD represents a novel approach to enhancing the reliability of air generated content now this Innovative model which is part of the gbd4 family this is specifically designed to help human reviewers in detecting and critiquing errors in the code which is produced by Char GPT and here they have mentioned that on code containing naturally occurring llm errors the model written critiques are preferred over human critiques in 63% of cases and here further you can see that human evaluation finds that models catch more bugs than human contractors which are paid for code reviews so this has Pro provided that method which basically solves this growing challenge of evaluating increasingly sophisticated AI outputs particularly as large language models become more and more complex so heading over to the introduction section what makes the most capable AI systems effective today we know that they are all trained with reinforcement learning from Human feedback RL HF now this method leverages the fact that evaluating AI output is is usually faster and easier for humans than demonstrating the perfect output themselves so as AI models become more and more advanced now what is happening is even seasoned experts are struggling to reliably assess their outputs now this limitation of human evaluation is a fundamental issue with rlf and the field of scalable oversight aims to solve this by training models to help humans evaluate AI output more effectively and previous research has shown that methods like debate can help humans better assess answers to reading comprehension questions but now it's time to assess these models in more realistic settings so how can scalable oversight help humans assess model written solutions to real world tasks and for the first time the research demonstrates that scalable oversight can help humans more comprehensively assess these Solutions particularly in writing in code now the core idea over here is simple they trained an auto regressive policy to take a question and answer pair and then output a text critique pointing out errors using rlf on challenging real world data so they developed a GPD 4 based critic model which is called critic GPD which outperforms humans at detecting bugs here you can see in this figure they've shown that llms catch significantly more inserted bugs than humans and that model critiques are preferred over human critiques more than 80% of the time and human machine teams basically combining humans with critic GPD the model that they train those combinations this combination of humans and critic GPD they write more comprehensive critiques and avoid the nitpicks and hallucination better than the models alone the contribution of This research basically includes demonstrating a scalable oversight method for real world rlf data which showcases critic gpd's Superior bug detection and critique preference and then highlights the effectiveness of human machine teams and further you know introduces uh a technique called Force sampling beam search fsbs to balance real and spous issues in uh critiques which we'll see in a bit now let's talk about the methods that have been Incorporated in order to train this model the llm critics are basically Auto regressive Transformer models similar to your chart gbd they take question answer pair as input and generate a text critique which highlights the potential problems now the critiques include comments attached to specific quotes from the answer and to talk about the evaluation methods that they have the steps that they have taken first of all if we talk about the critique attributes now the critiques include comments contractors now are hired to evaluate these critique based on comprehensiveness inclusion of specific bugs uh which is CBI presence of hallucinated bugs or nitpicks and overall subjective helpfulness and they rate them on a 1 to S scale where one being least helpful and seven highlights the most helpful comment or critique now how are critiques compared and rated contractors compare four critiques of the same problem without knowing their sources and score them on various attributes which we talked about this comparison gives a clearer estimate of the model performance the scores are then summarized using ELO scores which estimate the win rate of one model's critique over anothers and confidence intervals uh for these scores are provided using a non-parametric bootstrap method let's talk about the data distribution how did they prepare the data for this model almost all training and evaluation data comes from the openi rhf pipeline consisting of question and answer Pairs and they focused on cases where the model's response was at least 50% python by line count using the largest code block as the answer and discarding the rest now they worked with two sets of data containing known bugs first the human inserted bugs where the contractors were paid to insert bugs into the answers and second the human detected bugs where the programming tasks flagged by original rers in the CH GPD labeling pipeline now human detected bugs are more natural and usually easier for humans to spot since they were already caught once on the other hand inserted bugs are tougher and more severe all right now the meaty part of this paper how are llm critics trained using rlf now the process follows a pipeline similar to charity first several critiques are sampled for each question and answer pair contractors then rate these critiques on various attributes including overall quality now a reward model is trained to predict these quality rankings and a policy is optimized against this model using pop now finally an inference time sampling strategy which is called your fsbs or Force sampling beam search is applied now additionally there's an optional First Step called tampering now a large fraction of the data actually comes from this particular step here the contractors introduce subtle bugs into the model written answers and explain the problems they introduced now this adversarial approach makes the bugs harder to detect and contractors also verify that an LM critic misses these bugs at least once in three samples and to validate discriminators are trained to predict whether the code has been tampered with or not now the tampering procedure provides the majority of training data making it more similar to human inserted bugs than human detected bugs now what is this Force sampling beam search fsbs approach that they have introduced and how have they used this in addition to rlf fsbs combines a reward model with search to generate longer and more comprehensive critiques reducing hallucinations and nitpicks because whenever you whenever these models are critiquing a long data long code snippet they tend to hallucinate and make up answers so we can break down fsbs in three parts first is the sampling during critique generation fsbs forces the model to produce highlighted sections of the input text that are quoted in the critique now this is done through constraint sampling where the model is encouraged to select text spans or code Snippets that are likely to contain bugs or issues then we have beam search now this model uses beam search to explore multiple candidate uh critiques in parallel and this allows it to generate longer more comprehensive critiques by considering a wider range of possible highlighted sections and the comments that are written alongside then finally we have the scoring model now this is basically just a scoring function that balances two objectives one is your RM score the RM score is the reward model which indicates how well the critique uh points out bugs or issues and then the second parameter is length modifier uh multiplied with number of highlights so this is a length penalty that discourages the model from generating overly long critiques with excessive hallucination or nitex so this fsbs model then basically provides a balance or you know manages the tradeoff with the hallucinations and nitpix and this was a new new method a new approach that they have introduced in this paper now let's look at the kind of results that this whole process has produced llm critiques especially those by critic GPD are often preferred over human critiques and they catch more inserted bugs that's what they're claiming and contractors significantly favor critic gpd's critiques over those from prompted chart GPT and human written critiques the tamper plus the rlf pipeline that uh they have produced this has improved the bug detection rate with llm critics catching more bugs than human annotators and When contractors didn't have llm assistants they made fewer comments possibly due to the need for specific domain knowledge and difficulty of some tasks now these factors explain why llms outperform human contractors and additionally critic gpd's reinforcement learning training this enhances model performance across different scales and they evaluated a family of gp4 mods and found that increasing the mod size improves bug detection to match critic gpd's performance without extra data or modifying the chat GPD pipeline uh about 30 times more pre-training compute would be needed in order to achieve that critic GPD helps humans create more comprehensive critiques especially in human plus critic GPD teams it outperforms the Char GPT on both human detected and inserted bugs while models have higher rates of nit pegs and hallucinations the human plus critic GPD combination these this teams when they work together they strike a balance now while chat GPD was trained with more data and compute critic GPD was was tested with a similar setup for a fair comparison and critic GPD showed higher precision and recall on human detected bugs proving more effective for code critique models without tamper data underperformed likely due to lower agreement rates and a less effective reward model as well llm critics generalize Beyond code and yes they can critic GPD was tested on General assistant Tas ask S from chat GPD training data marked as Flawless now in 24% of cases critic GPD identified problems that lowered the rating significantly compared to only 6% without critiques Now using critique reward models to prioritize tasks also improved problem detection and reduced the hallucinations so what's the key takeaway about large language models it's that they have become so Advanced that typical humans can't consistently evaluate their output without help and this highlights the growing need for scalable oversight methods whether rlf Remains the primary uh post trining method or not we must ensure that the model uh outputs are trustworthy so the approach here is straightforward training models to help humans evaluate other models and these llm critics are already successful in catching bugs in real world data and even accessible models like chat GPT uh can significantly assist these human annotators and contractors as llm intelligence continues to improve finding scalable methods to reward these uh right behaviors in AI systems is going to be crucial and llm critics uh show promise as a starting point all right uh so did you guys like it if you did give it a thumbs up let me know in the comments what other variations or you know details I should add uh how can we make it more interactive would love your feedback uh on the series and uh do tell me the kind of papers that we should read together uh if you want to give suggestions you can also join my Discord uh Community where we keep uh sharing these sort of uh resources but yeah that's it for this time I'll catch you guys in the next one until then keep learning keep building

Original Description

OpenAI has unveiled CriticGPT, a new AI model based on GPT-4 designed to identify errors in code generated by ChatGPT, marking a significant step towards improving the accuracy and reliability of AI-generated outputs. Link to the paper: https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf --- ## Want to get more development-oriented insights? Subscribe to my Newsletter to stay up to date on such updates in the world of AI - High Signal AI Newsletter: https://highsignalai.substack.com/ - High Signal AI Instagram: https://www.instagram.com/highsignal_ai/ ## AI Engineer Roadmap - Roadmap video: https://youtu.be/br8u4JwXMBU - Roadmap GitHub (don't forget to leave a star): https://github.com/dswh/ai-engineer-roadmap ## Social Media & Discord Server Invitation Follow me for more AI Engineering resources, tutorials, and reviews: - LinkedIn: https://www.linkedin.com/in/tyagiharshit/ - X / Twitter: https://twitter.com/dswharshit - Join the Discord community for ideas, discussion, reviews, and more: https://discord.gg/rssxJV2Xkz ## Chapters 0:00 Intro and tea 00:17 📄 Critic GPT overview and significance 01:35 🧠 Challenges in evaluating AI outputs 04:23 🛠️ Methods for training LLM critics 06:51 🔄 Training with RLF and FSBS approach 10:08 📊 Results and implications of Critic GPT 13:50 What should we read next!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Harshit Tyagi · Harshit Tyagi · 57 of 60

← Previous Next →

Your PATH to learning Data Science

Your PATH to learning Data Science

Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.

Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.

Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.

Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.

Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub

Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub

Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists

Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists

Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions

Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions

Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules

Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules

NumPy Essentials for Data Science - part-1 | One Dimensional Array

NumPy Essentials for Data Science - part-1 | One Dimensional Array

NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array

NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array

Math For Data Science | Practical reasons to learn math for Machine/Deep Learning

Math For Data Science | Practical reasons to learn math for Machine/Deep Learning

Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy

Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy

Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science

Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science

Python vs R | The BEST programming language for your Data Science Project

Python vs R | The BEST programming language for your Data Science Project

Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy

Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy

The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account

The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account

Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey

Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey

Speeding up your Data Analysis | Hacks & Libraries

Speeding up your Data Analysis | Hacks & Libraries

How to build an Effective Data Science Portfolio

How to build an Effective Data Science Portfolio

End-to-End Machine Learning Project Tutorial - Part 1

End-to-End Machine Learning Project Tutorial - Part 1

Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2

Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2

Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3

Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3

Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4

Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4

Three Decades of Practising Data Science | Interview with Dean Abbott

Three Decades of Practising Data Science | Interview with Dean Abbott

Calculating Vector Norms - Linear Algebra for Data Science - IV

Calculating Vector Norms - Linear Algebra for Data Science - IV

Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow

Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow

Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N

Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N

Building projects with fastai - From Model Training to Deployment

Building projects with fastai - From Model Training to Deployment

October AI - Video Calling with One-Tenth of Internet Bandwidth

October AI - Video Calling with One-Tenth of Internet Bandwidth

November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...

November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...

Data Science learning roadmap for 2021

Data Science learning roadmap for 2021

Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra

Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra

Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)

Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)

Tableau vs Python - Building a COVID tracker dashboard

Tableau vs Python - Building a COVID tracker dashboard

[Explained] What is MLOps | Getting started with ML Engineering

[Explained] What is MLOps | Getting started with ML Engineering

Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science

Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science

Five hard truths about building a career in Data Science

Five hard truths about building a career in Data Science

Computing gradients using TensorFlow | Training a Linear Regression model from scratch.

Computing gradients using TensorFlow | Training a Linear Regression model from scratch.

Foundations for Data Science & ML - First steps for every beginner!

Foundations for Data Science & ML - First steps for every beginner!

Course Outline - Foundations for Data Science & ML

Course Outline - Foundations for Data Science & ML

How Machine Learning uses Linear Algebra to solve data problems

How Machine Learning uses Linear Algebra to solve data problems

Calculus for ML - How much you should know to get started

Calculus for ML - How much you should know to get started

Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking

Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking

AI Engineer - The next big tech role!

AI Engineer - The next big tech role!

AI researcher vs AI engineer | The next big tech role!

AI researcher vs AI engineer | The next big tech role!

Reviewing LLMs for content creation

Reviewing LLMs for content creation

Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering

Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering

High Signal AI - the most action-oriented newsletter on the web! #ai

High Signal AI - the most action-oriented newsletter on the web! #ai

Building an AI-powered Discord Chatbot Locally for FREE using Ollama

Building an AI-powered Discord Chatbot Locally for FREE using Ollama

Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes

Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes

Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2

Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2

Watch the full video on my channel - Roadmap to become an AI Engineer.

Watch the full video on my channel - Roadmap to become an AI Engineer.

Mesop - Python-based UI framework from Google!

Mesop - Python-based UI framework from Google!

How I automated my YouTube | Gumloop tutorial | No Code

How I automated my YouTube | Gumloop tutorial | No Code

ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark

ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark

Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases

Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases

Claude #AI artifacts are just amazing!

Claude #AI artifacts are just amazing!

OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me

OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me

Day in my life | Vlog #1

Day in my life | Vlog #1

How to add AI Copilot to your application using CopilotKit | Tutorial

How to add AI Copilot to your application using CopilotKit | Tutorial

Quick Questions with an AI Founder - Anudeep Yegireddi

Quick Questions with an AI Founder - Anudeep Yegireddi

The video discusses CriticGPT, a novel approach to enhancing the reliability of AI-generated content, and its ability to identify errors in code generated by ChatGPT. The model uses techniques such as Force sampling beam search and RLHF to balance real and spurious issues in critiques. By watching this video, viewers can learn how to build and fine-tune CriticGPT models, and apply retrieval augmented generation and evaluation methods to improve AI-generated content reliability.

Key Takeaways

Sample several critiques for each question and answer pair
Rate these critiques on various attributes including overall quality
Train a reward model to predict quality rankings
Optimize a policy against this model using PPO
Apply inference time sampling strategy using Force sampling beam search

💡 CriticGPT reduces hallucinations and nitpicks in GPT-4's critiques by using constraint sampling, beam search, and a scoring model, and helps humans create more comprehensive critiques.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective

Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective

Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro

Compare the debugging capabilities of DeepSeek V4 Pro and MiMo V2.5 Pro on a real-world GitHub bug

Dev.to · Stanislav

Chapters (7)

Intro and tea

0:17 📄 Critic GPT overview and significance

1:35 🧠 Challenges in evaluating AI outputs

4:23 🛠️ Methods for training LLM critics

6:51 🔄 Training with RLF and FSBS approach

10:08 📊 Results and implications of Critic GPT

13:50 What should we read next!

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)