TEST TIME Optimized AI REASONING (MIT)

Discover AI · Advanced ·🧠 Large Language Models ·1y ago

Key Takeaways

Optimized Test-Time Training (TTT) with Leave-One-Out (LOO) strategy for large language models (LLMs) is introduced, specifically applied to the Abstraction and Reasoning Corpus (ARC) test, with tools such as T3 Star and Open Ey O1, demonstrating improved reasoning capabilities in LLMs through instance-specific training and test time augmentation.

Full Transcript

T3 stand simply here for a test time training and you're not going to believe MIT found if you do an optimization of this old methodology this could play a real important role in the advancement of the next generation of our AI systems so why not have a look hello Community we are back and we look here from a publication from MIT about the test time training for abstract reasoning and MIT optimized to hell out of this system they investigated every aspect of it and there are some surprising fact how we can increase here the complexity and the intelligence of AI systems so with this new TTT Star as I call it here with the significant improvements we can now solve evoluation task up to six time better and this is unbelievable because we have on the RC test now at 53% and accuracy and you might say what the hell is Arc this is it here the publication from November 5th 2019 from Google on the measure of intelligence and this is a new Benchmark it is called the abstraction and reasoning Corpus build an explicit set of priors designed to be closest possible to innate human prior so we have here a measure of how intelligent are our AI system compared to the humans in a real world test and I would like to show you one of those test tests normally we have up to a grd of 30 * 30 we have in context examples let's say we have given here a set of this and this is the result and another test set here and this is the result and then we have here test set and now the machine has to predict here given the findings the logic the Insight the intelligent after analyzing here this examples here what is it and if you just fine-tune your lln you got something here to fine tune that is not according not discovering here the hidden rule but if you use the new TTT star yes then you are able to solve this and the performance is 53% so MIT investigated the T3 here particular and I would like to stress only four Arc Benchmark this was the only focus you can't really use this one to one for any other system system this is only the arc solution but this solution has some beautiful insights let's have a look there are three crucial ingredients am told us here at first you have to fine tune your llm before you do anything with T3 on similar tasks so you have to be in the domain and you have to understand the reasoning path that you are going to expect in your test and therefore you have a fine-tuning process exactly on the domain on the task then you have to construct a test time data set augmentation methodology and they tested several and I'm going to show you the best one and then we come to the core of T3 star and this is here an instance specific training that is happening during inference during the inference run the system will now take a minute 2 minute five minutes to do this instant specific training and it does not do it on a complete system so now T3 star allows the model the llm to adapt to the specific structure of each task during the inference run and you might say wow so it's not a general knowledge that the model has no the mod the model adjust itself during the task given and this is why it makes it so similar to o1 because you know from open ey o1 also takes a minute or two minutes to syn about it and now during this syncing we have now a fine-tuning happening on the system but therefore if you do find tuning we need test data and we need a test time augmentation data set generation methodology that MIT said with this you are particular successful so let's have a look we will use some geometric transformation like flips and rotation for the prediction here of the selection significantly enhancing here the robustness of the system but there is because this geometric transformation where also here in the idea of the generation of the arc test itself so this is now clever knowing exactly what the test is going to be we now optimize here the system for our Arc test and for the arc test only because then those geometric transformation that we're going to do to our test time data augmentation they're especially helpful how we do this so we have now for the inference now our test time data set construction we have to build data that are task specific per inference run and there's a simple methodology and I will explain why it is called leave one out task so each input output pair from a task is left out in turn when the remaining pairs forming the syntatic training set I've shown you the image and this kind of mimics here few short learning forcing the llm to reason here from a minimal a real minimal two three example data set so let's have a look at this we have here our Arc test task and this consists here of exactly this one so we have here our x0 and the prediction is y0 then we have X1 prediction is y1 X2 prediction should be Y2 and now we have the real task we have X3 and now learning the inherent pattern here in our one two three if you want training data sets now please the llm should predict you the correct learned pattern now for fine-tuning we understand we need quite a huge data set but this is not available this is the beauty of Arc that it only has very limited data 1 two three so what this system does and the idea by MIT is we have now generate here a much more voluminous data set so what we do we take this one two three and we have an identity great then we do since this a geometric uh pattern a geometric operation a horizontal flip we' have a vertical flip maybe we have rotational flips so you understand what we doing we're taking the data we are accenting them we get much more training data and suddenly our very specific fine-tuning mechanism becomes effective because we finally have a minimum amount of training data said that this fine tuning is really going to happen and we are able to understand this but wait it is not that easy because I was struck here by leave one out why now there's a real particular reason and that then this rule based augmentation you know the vertical horizontal identity great so we just generate more training data in the inference run itself so what we do we just increase the data set we augment the data set in a coherent way simple geometric and compositional transformation preserve your inherent task structure while introducing here the variation and reaching here the test time training data set that the system needs for the fine tuning and I thought is this not ICL in context learning in our prompt no it is not and this is important because in ICL the model does not update its parameter it relies entirely on the examples embedded in the input prompt but now it is different because in T3 star the model does the LM does dynamically update its parameter to adapt to a very specific task and the parameters are updated not for the complete system but for low rank adapter so we have now an optimization technology that we say we build Laura adapters here for those specific tasks that are not in context learning and this is why it takes sometimes 1 minute or 2 minute for the system the llm to respond because the system is sinking it it's expanding here the solution space and it is going through multiple solution so model adaptation with Laura so we have task specific low rank adapters they are trained now for each single task during the test time fine-tuning the system and enabling here the efficient parameter updates of the llm without modifying here the full tensor structure in all of the layers of our llm they are lightwe they allow here at per instance training without overwhelming computational resources but of course they go up because if you wait 2 minutes for an answer they go up great so let me summarize this MIT tells us we have three critical components for T3 store at first we need a pre-trained llm that is really on the domain that is really specific because we will use this here for AOC only pre-train on abroad carus and it is not only specialized for task like Arc but we know from another research that fine-tuning if it is off domain it it has a a data shift it is not effective so you have to choose here a pre-training llm that has here in the pre-training data set already let's say similar argumentation similar detection patterns like then in the arc task but now to optimize our pre-training elements step two is here to do the finetuning for our llm particular for Arc like task so our llm under goes now a very specific fine-tuning on task and we design those task because we know the arc test absolutely resamples this Arc abstract reasoning patterns so we build a data set only for the fine-tuning task that is absolutely similar to the test we going to encounter we know the test and we have month to prepare our system for this test questions and still AI just reaches 53% isn't this beautiful so this St here fine tuning ensures that the llm has a basic understanding of the reasoning principle it's going to encounter in the arc test and the transformation the geometric transformation rules it will encounter during the ERC test so this fine tuning involves the static data set with problems that are really similar to the test but does not adapt to the individual Arc tasks yet because this is happening in Step number three the test time training during the inference and now we are not going with the old TTT but now we are going with the T3 star optimization of MIT so during the inference run for each specific Arc task this is a task specific idea the llm performs now additional fine tuning dynamically using here task specific data so now for each single task the system says okay I'm going to look at the task then I'm going to build here a specific data set for this task only then I train here my Laura adapters on this single task only during the inference run and instead of three trining here the entire LM only a small set of the adapters here the parameters the tensor structure in this parameter are updated for this single specific RC task so you see for each single task in the test we do have a specific fine tuning only on this task and this is why it takes so long great now there's another detail if you're not familiar with the ERC data set and I was not familiar so I had to read this so we have already here a training set and an evaluation set here in the arc data set in this test and the training set features you have 400 task while the evaluation set features 600 task and this 600 are still subdivided further split in a public evaluation set in a private evaluation set so you really have the data but you remember what I told you that each task here in ERC consists of a small number number of demonstration examples and this is the beauty because we do not have for an original fine tuning the classical fine tuning this is not enough because our demonstration examples are about three on average per task like the one we have seen here x0 X1 X2 three examples three demonstration examples and then you want to have the prediction here for one and now what we are focusing on one the idea of MIT for this optimized T3 star is we focus here on this demonstration examples now and this demonstration examples we do some additional synthetic training data generation and training data augmentation especially for the lower adapter tuning during the inference run so during inference we create new input output demonstration examples that are really spot on to this single ERC task and this is why it takes so long for the system to run this short summary for each specific Arc task we have no we the system has to construct the synthetic data set from input output examples here task specific and augmented data set ACC create synthetic additional data because we know the arc test we can prepare in advance no we know exactly the reasoning complexity and we can instruct our system how to augment the data set and how to create a new synthetic data set according to certain geometric operations then if we have the data set now we fine tune the Laura adapters but I think fine-tune is now the wrong word because we are here in inference this is done during the inference we are nowhere else we are here in the real world we are waiting for the answer so I would call it here uh I don't know a lur training because fine tuning here the low adapter interference sounds strange now and then we use here the updated llm with this new Laura adapter that are now task specific to make a prediction to solve the task so there we have it we have created now our future AI agent now impressive massive huge we have minimum 9,000 of the physical human force in this a agent but for the intelligence we reach maximum 53% of the human intelligence even with this latest T3 star methodology by MIT this is the intelligence in this body what a machine but you know what if we have a a deep dive into this paper I think there's another secret because T3 star works for particular reason this task during the generation in 2019 those task were highly abstract and variable meaning or designed that no single pre-trained to fine tune model can generalize it perfectly because they are so diverse and this T T3 allows you the model to specialize dynamically to the nuances of each task during the inference and this is this special feature that is important here to solve the RC test so during the inference Round We generate from the demonstration RC examples that I showed you more similar coherent data set for the Laura adapter tuning to be efficient and you see I now use the word Laura adapter tuning and not fine-tuning because I think fine-tuning in the inference run is a little bit strange and that's also why it's called T3 great this leave one out training I thought is this cheating if you know the test before if you know all the examples before if you have complete imagine you are a student and you know in a week you have to write a test and the test is given to you a week before is this not cheating in a certain way well if you really are technical not really really because the training is the specific leave one out training and they say hey if we have a training sequence we always leave one element of the training sequence out it is unseen unknown and we train around the other examples and we update the lower parameters on the synthetic data set here and we always leave one out that we have to predict so the key principle make it here clear of this leave one out is for a given test input output pair X and Y at the time T the model is fine tuned or lower tune using the remaining PS pairs from the task so X1 y1 X2 Y2 to predict this so this process is now repeated independently for each single test pair and the independence of this process ensure that there is no overlap in the training data for a specific pair being predicted this is a very carefully designed training procedure that is highly specific to the arc test I think outside of this you would have real problems implementing this particular training methodology so for each step you do this let's say simple example predict y1 for X1 so you exclude now this from from the task so you train now your adapt your model here on the remaining pairs X2 and X3 you f tune the lower adapter using not only this subset without X1 and then you use here the predictability Power by the Lowa adapters that have been trained now on this to predict y1 for X1 and then you go to the step two and step three so you go here in sub steps of the task and you divide the task here into independent evaluation and prediction tasks now I was not convinced that this is not cheating I said but you give the complete data set to the system and then you say do the test and then you fail in 47% well if the M reuses the parameter across the prediction within the same task it could be consider cheating because then you would allow the all to leverage information from the test per he has already predicted and this could artificially inflate your the performance by creating your unintended dependencies but you know this is a trick the lower adapters a reset for each prediction task and I thought are they you have multiple adapters for one specific task no it seems in my understanding and maybe I'm wrong if I already did is that the lower adapters are researched for each prediction so this means that we are now really deep down into a specific task and it's subdivision of the task so in my understanding this new T3 star mechanism fundamentally deconstructs here the main RC task into independent subtask and then solving each sub task in isolation so this mean the very specific design of the arc test and now this very specific answer by mat to optimize T3 star for this test leads to a new design solution that focusing on the local reasoning on the sub problems rather than addressing here the entire task complexity as a whole as a unified problem you break down the task into subtask you reduce the complexity during the inference run of course because otherwise you have to wait hours for one answer but I ask does this division does this deconstructs of the task makes sense because splitting the task into the subtask allows the mod to adapt to a specific sample making it better suited for the future learing nature of RC which is great and deconstruct the task reduces now of course the computational burden of soul the entire task at once and now we have little task and we can solve this little task that are easier in a faster time scale but are there limitation and in my simple View and this is not here the official view by MIT but I think with this methodology you kind of lose the task complexity because dividing that the task into subtask and this is possible because of the very specific nature of ASC examples this now ignores here the kind of I call it interconnectedness of the relationship between all these input output pairs you know x0 y0 X1 y1 X2 Y 2 if you do this here so break down version because you only have one task and you optimize here the adapters for each single task you can do this but imagine you would have a question where you would have two three interconnected task then I think this new MIT methodology would face limitations and it might introduce logical gap between those tasks and in a test it's easy now you have single test question and each single test question has one result and maybe those single test questions are not related in this test but but in real world if you face a problem you have related problems that you have to solve and therefore I think this will fail in this specific use case but this is really a detail so I personally think there's no holistic task understanding with this particular Solution on T3 star so the system doesn't develop a unified understanding of the complete task complexity of the complete test or of its interdependencies instead it performs real well 53% on each and every subtask without explicitly validating whether this subtask align into a coherent system as a whole and this system is the object we want to solve now in this video just less than a week ago I showed you that meta now I learned that meta repositions it's AI system now globally including here especially the use case here for meter us company us horse for National Security and everything that is connected with this and somehow I don't know I have a feeling if I think about the future that now with those meter llms they are now the intelligence of our agents and if we bring them here to if you want here the battlefield and I can imagine that we buildt here agents that are as I told you 9,000 10,000% human equivalent L here but what about the part of the intelligence and then then we have two of those facing each other this is the future because we humans are not able to to talk to each other and find Solutions I don't know I have a bad feeling about this but let's come to the positive side the positive side is that in my last video we were already talking about a very similar Topic in TTT and we were talking here about this brand new idea even just two days ago I think this reward guided Tre search framework for T3 and I showed you here if you want to increase here the reasoning capabilities of llm we have another methodology that works here at a more coherent holistic solution to a problem so now we have it in my video yesterday I showed you the holistic solution and in this video today I show you here the absolute Nitty Gritty detail subtask solution and yes you guessed it what my idea is if we take here the different aspects here from our T3 star from MIT with it's limited to local reasoning with its beautiful performance at this Fus short reasoning example given here for the fine-tuning of the Lowa adapters for each single task so that it solves one input output relationship at a time this decomposed automatic solution and we combine it with the idea of last videos for this reward guided Tre search framework where we have a holistic task solving capture here the global context capture here the global dependency handle here the complex task requiring here the global logic applied and the system learns and reasons across all the relationship at encounters and these are the technical reports I shown you already I think then then T3 really becomes interesting and I think this could be the way forward and you see this was the publication from November 11 and this was the publication from November 18 so I think next week could be a real interesting week if we look forward I hope hope you enjoyed it I hope you found some new ideas and it would be great to see you in my next video

Original Description

Optimized Test-Time Training by @mit : Shaping AI’s Future in Reasoning. This brilliant video introduces a novel approach to improving reasoning capabilities in large language models (LLMs) through Test-Time Training (TTT) with a Leave-One-Out (LOO) strategy, specifically applied to the Abstraction and Reasoning Corpus (ARC). ARC tasks require abstract pattern recognition and rule inference, often with only a few input-output examples. TTT addresses this by dynamically fine-tuning lightweight Low-Rank Adapters (LoRA) at inference time. The method deconstructs the main task into independent subtasks, using LOO to exclude one test input-output pair while fine-tuning on the remaining pairs and augmented data. This fine-tuning adapts the model to the specific logic of each task, enabling the LLM to better generalize abstract transformations while avoiding information leakage from the excluded pair. The augmentation process enriches the limited examples with transformations like flips, rotations, and rule-based variations, ensuring robust task-specific adaptation. This dynamic TTT process contrasts with static pre-training or in-context learning by actively updating model parameters during inference. Unlike in-context learning, which leverages examples directly as input without parameter updates, TTT uses the auxiliary dataset to fine-tune LoRA adapters for each subtask independently. This enables the model to handle ARC’s unique challenges, such as generalizing from minimal data and adapting to task-specific reasoning rules. Achieving a state-of-the-art accuracy of 53% on ARC validation, the approach demonstrates significant performance improvements over baseline methods and offers a scalable framework for abstract reasoning tasks, especially in few-shot scenarios. All rights w/ authors: The Surprising Effectiveness of Test-Time Training for Abstract Reasoning https://arxiv.org/pdf/2411.07279v1 00:00 Optimization of Test Time Training 01:08 ARC Intelligence test
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 0 of 60

← Previous Next →
1 Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
2 Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
3 Create a Smarter Future!
Create a Smarter Future!
Discover AI
4 The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
5 Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
6 Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
7 Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D   (SBERT 48)
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
8 Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey!  (SBERT 49)
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
9 SBERT Extreme 3D: Train a BERT Tokenizer  on your (scientific) Domain Knowledge  (SBERT 50)
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
10 Discover Vision Transformer (ViT) Tech in 2023
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
11 Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
12 Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
13 BERT and GPT in Language Models like ChatGPT or BLOOM |  EASY Tutorial on Large Language Models LLM
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
14 Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source)  #shorts
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
15 From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
16 How to start with ChatGPT?  | Short Introduction to OpenAI API #shorts
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
17 The Future of Conversational AI? Google's PaLM w/ RLHF  | LLM ChatGPT Competitor
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
18 Microsoft and ChatGPU
Microsoft and ChatGPU
Discover AI
19 From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
20 Google's 2nd Answer to "BING ChatGPT":  Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
21 TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
22 3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
23 FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
24 ChatGPT - Can it Lie to you?
ChatGPT - Can it Lie to you?
Discover AI
25 ChatGPT Alternative: Perplexity by Perplexity.AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
26 2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
27 Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
28 BLOOM 176B Inference on AWS  | Bigger than GPT-3 for more Power!
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
29 Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings?  My own ChatGPT? | Visual Q+A
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
30 Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
31 After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
32 Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
33 Fine-tune ChatGPT w/  in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
34 The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
35 New TECH: Vision Transformer 2023 on Image Classification | AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
36 PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned  | AI  Tech
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
37 New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
38 New BING ChatGPT loses its mind
New BING ChatGPT loses its mind
Discover AI
39 Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
40 Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
41 Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
42 PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
43 New BING Chat AGGRESSIVE
New BING Chat AGGRESSIVE
Discover AI
44 Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
45 Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
46 Dream Job Alert: AI Prompt Engineer - $335K  |  AI Prompt Design: A Crash Course
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
47 Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
48 Microsoft's CEO in Trouble   #shorts
Microsoft's CEO in Trouble #shorts
Discover AI
49 Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
50 OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
51 ChatGPT polarizes
ChatGPT polarizes
Discover AI
52 Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
53 ChatGPT Prompt Engineering w/ in-context learning (ICL)  - 7 Examples | Tutorial
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
54 Chat with your Image!  BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
55 ChatGPT:  Multidimensional Prompts
ChatGPT: Multidimensional Prompts
Discover AI
56 ChatGPT:  In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
57 Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
58 Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
59 Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
60 Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI

This video introduces a novel approach to improving reasoning capabilities in large language models (LLMs) through Test-Time Training (TTT) with a Leave-One-Out (LOO) strategy, demonstrating improved performance on the Abstraction and Reasoning Corpus (ARC) test. The methodology involves instance-specific training and test time augmentation, allowing the model to adapt to the specific structure of each task. By applying this approach, viewers can improve the reasoning capabilities of their own L

Key Takeaways
  1. Fine-tune your LLM before using T3 on similar tasks
  2. Construct a test time data set augmentation methodology
  3. Use geometric transformation like flips and rotation for test time augmentation
  4. Instance-specific training during inference
  5. Build data that are task specific per inference run
  6. Construct training set using leave one out methodology
  7. Fine-tune low rank adapters for each task during test time
  8. Update parameters of the LM dynamically for task specific adaptation
💡 The T3 Star system uses instance-specific training during inference, allowing the model to adapt to the specific structure of each task, and achieves improved performance on the ARC test through test time augmentation and fine-tuning.

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience
Medium · Machine Learning
Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.
Learn about Anthropic's Project Glasswing, a $100M club for its smartest AI, and understand the strategy behind it
Medium · LLM
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications
Dev.to AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →