S* for AI CODE Generation: Plus 100%

Discover AI · Advanced ·🧠 Large Language Models ·1y ago

Skills: LLM Engineering90%Fine-tuning LLMs80%Multimodal LLMs70%Prompt Craft60%Advanced Prompting50%

Key Takeaways

The video discusses S*, a hybrid test-time scaling framework for improving the coverage and selection accuracy of LLM generated code, and demonstrates its application in code generation using tools like TTS, s-star, and Transformer architecture.

Full Transcript

hello Community today we want to increase the AI coding performance at least 100% we call it SAR so let's have a look you know you have your local computer I don't know you have a GPU let's say 16 GB of v and you want to use here this typical code llm so to help you to code and you want to do it locally on your machine imagine you would have here a Q and coder that has a normal performance here of about 20% and you want to say hey can I at least double here my performance and yeah if we have here a particular s star methodology you can right jump up here to the normal performance of a coda of a 14 billion free trainable parameter so from a 3B to a 14b model and you have almost the same performance and this this is really something because if you if you are limited here on your local compute this is the way to get a better better performance without going to the cloud so what we do we run here code 3B model with s star and we have a performance of a code 14b model now how do we achieve this this is easy you remember we have test time scaling TTS and now we applied for code generation and you remember we already open the video now and we have here a star and this is the first hybrid test time scaling framework to improve it accuracy of eii generated code but you know the principle are identical to what we had in mathematical reasoning because you remember test time scaling it was extensively and we discussed it multiple videos here to improve you the causal reasoning the mathematical reasoning and we had more or less three ideas now that we implemented the parallel sampling that increases your the solution coverage the evental refinement that improves your the individual sample to the resyncing and revising and my last focal point was here reward models that guide here the search process and the search space much more efficiently talking about one of my last mod we have here three insane TTS models that I showed you here even with a new Transformer architecture to have this for a language model in this video we discussed it here on Vision language model so anything in robotic or anything that you have with this computer use where you simply have or the AI has to understand your computer screen and click on what button this year we have a look at the latest reward Vision language model and it was called arm up and in detail here 48 minutes I explain you everything about this methodology and now comes the step now we move to the third area and this third area is code so what is here our spark of Genius now this is easy this is is here publication from UC Berkeley great team beautiful idea February 20th 2025 published here and they tell us here SAR test time scaling now for code generation AI code generation they tell us here this extends here the existing parallel scaling Paradigm with a sequential scaling to push you the performance boundaries of eii code generation so finally we have test time scaling also for our code optimization now here you have not a complete visualization of the Performing gain that you can achieve you see the higher you go and when we have model that are already inherent with the rising model like an o1 mini the gains are not that massive but you see for the Q Coda we had an 80% standard performance and then we got more than double the performance now if you really go to this rather huge unfortunately propietary model like 01 and 03 you see dat increase is not 25% but at least around 10% which is also impressive for a model that has such a high performance so you can apply this here for the non-classical reasoning model for the normal models that we have then for the R1 reasoning model for the qw Q32 B preview reasoning model for the o1 and O3 reing models and whatever so this is a methodology you can apply and this is just great so let's come to the core idea and the core idea is simple if you remember what we have done for the language mod and for the vision language mod for all the robotics now guess what we do more or less the same we have a two stage a two face approach in stage number one what we do we have here yeah a problem description let's have a look at this so give him a positive integer number represented as a string return the integer number without trailing zeros as a string so if you have to input this and we have here some trailing zeros just get rid of the trailing zeros and we have the output and we have here some public test that you see beautiful and now what s star does in stage one or phase one is just generates here parallel samples so as star enhances here the parallel samples to an iterative debugging beautiful so this is the beauty with code with a debugger we immediately see where is the mistake each sample is tested using your public test cases executed via an interpreter with outputs and the error messages used here to guide the next round of the sample generation so couldn't be easier with iterative debugging going here multiple rounds you define the max round that you want to invest your time Budget on great and then stage two phase two simple EST star select the best sample by prompting an LM to to generate inputs that differentiate a little bit between the pair samples and then leveraging it actual execution result to inform the llm to determine the optimal choice you see this is in code so easy because iterative debugging then we have our interpretor and we immediately know hey is it working yes or no do we achieve what we set out to do yes or no couldn't be easier now I give you a complete code here in a minute it's ready for you you can do it immediately yeah let's talk about the facts facts is here and I have here from the study just want to show you three Benchmark data you have a q and2 a 7B instruct plus here this new s star code Improvement this 7B now outperforms a 32b instruct model on life code bench by quite interesting amount or if you go with gbd4 Omni mini and as star this s PES here the o1 preview now mind on one preview nobody uses this anymore but also if you have open reasoning mod to achieve performance competitive to stateof thee art proprietary closed models I show you dpse R1 distill to q1 32b with s store comes close here to stateof the-art the openi 01 High model so you see the achievements are there and they're really impressive and especially if you want to act locally or even if you're in the cloud hey if you get a 10 a 20% boost in your actual in your accuracy performance for your code generation this is something you should use so this is it more or less let's have a look at the end just want to show you here some beautiful ration studies here and they looked hey if I look here never mind whatever model it is but if we just have here the sore generation and then the selection what what is the most important action here now I told you this is a two-stage process so from the first one you see we got a 6.7% Improvement and the second is 13 see interesting the second one is so much more interesting now and here r one is still 14b we have a plus 3% and plus 16% so you see this second one is really interesting so this adaptive input synthesis is what you could call the the core the performance core of s star what does it mean in detail now for each pair of samples here and llm is prompted to generate here distinguishing test inputs in this inputs and of course executed we in Python here where the outputs are further provided to ground to llm to select you the best samples so it's rather easy know this adaptive execution grounded approach with the code ensures here really a robust identification of the correct solution so you see this is interestingly here for code a really important step yeah as I told you we do have a GitHub page Nova sky sky sword beautiful remember this is the same team from UC Berkeley that give us here this specific Sky T1 this was similar here to the 01 model to the 01 preview model here you have the GitHub bio beautiful and as you see I'm really early because just one hour ago sky sword they updated here the code for the S star approach so you have everything available here if you go there and you saw okay what is Nova sky or Sky sour whatever different teams from UC berley I show you them in a minute here you have here the beautiful re-release as star so February 21st 2025 code paper simple extended with test time scaling framework for code generation it really works with quite a lot of model I would surprise if you find a model where it's not working at all your Pyon notebook jupyter notebook everything is there for the team I don't know if you notice you see berley C Computing lab real nice you have the all the people this is just the the first thirt of them so you have the faculty core faculty professors and Associated professors here and faculty and then yes beautiful you could scroll down two three times to see all the student and whatever so amazing team great code this is here if you have a look back where is it Apache 2 license so it is for you why not use this I just recommend if if you use this because go and upgrade your local code llm or maybe even then in short time some code Vision language model to have a more physical understanding of the physical surrounding of a robotic system but this would be the content of our next video If you like this kind of videos hey why not subscribe and I see you in my next video

Original Description

S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of LLM generated code. Also for deep reasoning models. All rights w/ authors: S*: Test Time Scaling for Code Generation Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica University of California, @UCBerkeley Work done w/ support from https://lambdalabs.com #airesearch #codegeneration #aicoding #berkeley

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 0 of 60

← Previous Next →

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Create a Smarter Future!

Create a Smarter Future!

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

Discover Vision Transformer (ViT) Tech in 2023

Discover Vision Transformer (ViT) Tech in 2023

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

Microsoft and ChatGPU

Microsoft and ChatGPU

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

ChatGPT - Can it Lie to you?

ChatGPT - Can it Lie to you?

ChatGPT Alternative: Perplexity by Perplexity.AI

ChatGPT Alternative: Perplexity by Perplexity.AI

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

New TECH: Vision Transformer 2023 on Image Classification | AI

New TECH: Vision Transformer 2023 on Image Classification | AI

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT loses its mind

New BING ChatGPT loses its mind

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

New BING Chat AGGRESSIVE

New BING Chat AGGRESSIVE

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Microsoft's CEO in Trouble #shorts

Microsoft's CEO in Trouble #shorts

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

ChatGPT polarizes

ChatGPT polarizes

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

ChatGPT: Multidimensional Prompts

ChatGPT: Multidimensional Prompts

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

The video teaches how to use S* framework to improve code generation performance and apply test-time scaling to code generation. It demonstrates the application of S* framework using tools like TTS, s-star, and Transformer architecture. The video also discusses the two-stage approach for code generation and adaptive input synthesis.

Key Takeaways

Run code 3B model with s-star
Apply TTS to code generation
Introduce sequential scaling for code generation
Extend parallel scaling paradigm with sequential scaling
Generate parallel samples using s star
Select the best sample using adaptive input synthesis
Execute test inputs in Python
Provide outputs to LM to select the best sample

💡 The S* framework can improve code generation performance by 100% and provides a two-stage approach for code generation using parallel sampling and adaptive input synthesis.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

DeepSeek vs Qwen vs Kimi vs GLM: Which AI API Actually Wins in 2026?

Learn which AI API wins in 2026 among DeepSeek, Qwen, Kimi, and GLM, and why it matters for your project

Better Models: Worse Tools

Newer LLM models can be worse at using certain tools due to overfitting to specific edit tools, making it challenging for third-party coding harnesses to implement compatible tools

Simon Willison's Blog

Una capa de prompts que se califica a sí misma por resultados, hace A/B testing de sus propias reescrituras, e intercambia al ganador casi sin despliegue

Learn how to implement a self-evaluating prompt layer for A/B testing and automated deployment

Dev.to · Franchesco Romero

LLM APIs as Infrastructure: Building Deterministic Systems Around Probabilistic AI

Learn to build deterministic systems around probabilistic LLM APIs using structured schemas and validation techniques

Dev.to · Akilah Littlejohn

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)