VISUAL Intelligence - Latest Research
Key Takeaways
The video discusses the latest research on VISUAL Intelligence, introducing a new method called RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought, which enhances visual reasoning in large multimodal models using reinforcement learning and segmentation models. Tools such as PyTorch, Google, Jamma 327B, vision transformer, Liza, RSVP, and SAM segment anything model are utilized.
Full Transcript
Hello community. So great that you are back. Now after a week that we talked only about textual reasoning. Now let's advance the complexity and let's include visual reasoning. So everything from images to short videos. What could go wrong if we look here at the latest AI research? So let's start by simple example. save this image and I say, "Hey, find the tool that I would use to loosen here a rusty bolt." Now, you might go with an object detector, an AI system that will find all the different tools, or you will go with a segmentation model that might not understand at all the context of rusty bolt. or you go with a multimodal LLM that can understand that you're probably looking here for a wrench but it struggles to precisely outline that wrench in this particular image. So you see we have to improve the visual reasoning abilities of the eye system and luckily we have the latest research publication just from two days ago. Also, there's now a motto for this video. Be the reason someone smiles today and I hope it is you because I'm going to show you AI today and where we should be in the next days. This is here simple synthetic image and I ask here on Google ajamma 327B explain every item you can identify on this image and I have to tell you Jamma 3 is real good because tells me central processor unit the core with the die silicon the heat spreader integrated heat sink layer components circuit motherboard traces lines layers connector pads particles closed the waveform the color impressing plus futuristic elements, holographic and wireframe appearances, the geometric shapes, the color palette, a highly stylized and artistic representation of a computer processor and its surrounding technological ecosystem. So not bad, no for a free system. And you would say, hey, is this already or is this still a visual segmentation of different objects in the image? Are we already close here to a visual reasoning model? Now two years ago we coded together here in PyTorch a panoptic image segmentation with a vision transformer and I showed you the mask to former the specific code and we took this image and we could then identify look here you see the segmentation for each and every person or thing or cardboard or car and the degree of certainty which the eye could detect that this is a person 99%. So we know this for two years. So today we are talking hey what about I say identify the most crucial part in this image here of this CPU for the tensor calculation of a reasoning validation protocol. So I have now a task where the AI must do some reasoning. It must understand the object, the object environment, the object attributes, the functions, the time series maybe of how the object will evolve. So we are talking now about a reasoning query. It has a higher complexity. Gas 3 comes back and says hey of course the most crucial part in this image is the central processor the CPU itself because the tensor operational hardware driven the silicon die is the computitional engine. We have parallel processing and modern CPUs and the protocol rely on speed and accuracy. Why not the other elements? So chain of sort it is not a circuit board because this just provides connectivity but it doesn't do the calculation. This is just the hybrid not the engine and so on. So chain of sort is really helpful here and now you are familiar about this. Maybe we talked about empowering here LLM LMM so large multimodal models with a strong reasonability through a two-stage rule-based reinforcement learning. And you know everything that is at the beginning of 2025 has to do with reinforcement learning. This is the absolute paradigm. But how do we have now this enhancing of a reasoning capacity in our large multimodal model? because we face challenges and we face challenges from this complex interplay between the visual perception and the logical reasoning. Now in this particular paper they found a solution that they use here a two stage formula the first stage one the foundational reasoning enhancement and then a multimodal generalization training. So they built one on top of the other and then they volunteer with a classical RL policy optimization. Great. But this is about rather a simple concept. No, but talking about simple concept, you know, you know our good old friend Liza. Liza is reasoning segmentation via a large language model. Liza stands for language instructed segmentation assistant and LIZIZA was the first model to my knowledge about reasoning segmentation. It's really a crucial task in multimodel grounding requiring models to produce here those famous segmentation mask for complex implicit textual queries. I will give you an example in a minute. Liza was a model that tries to do everything at once. We put everything into liza. Eliza should do everything at the same time and the output was just one special segmentation token to guide you the segmentation model. And guess what? It was not the perfect solution. Just think of it solving in your head here a complex differential equation and plot the solution curve simultaneously in your head at the same time. Perform multitask in parallel. This is not the optimal configuration. Now we have large language models. We have multimodal large language model with textual and visual modalities combined. But they remain incapable of generating precise segmentation mask. So we went on to referring segmentation models. I will explain them in a moment. They can identify the object boundaries but struggle now in this particular view with high level inference and reasoning. preventing them now from effectively tackling the reasoning segmentation. So the solution was to bring everything together under one huge umbrella of fine-tuning large language segmentation model on huge on large scale data set extremely costly extremely impractically and it completely lacked scalability. So it was not really the perfect solution and now the authors of today's paper they said okay so we identified the two key limitation from Liza it was the implicit reasoning the model sort process for what it could choose is completely hidden we don't know if it correctly reasoned for particular objects and everything is entangled and squeezed into one model so MLM forced to learn about highlevel reasoning and at the same time about a load level spatial localization and this is not how systems should operate. So, welcome to the new paper of today, RSVP, reasoning segmentation via visual prompting. A reasoning driven multi-stage framework that unifies now a multimodal chain of thought prompting with a visual segmentation. And you might say, hey, why so simple today? Well, we are just starting. Give me a second. Yeah, if you see here visual prompting, if you are new to AI, just one second of explaining visual prompting, have a look at this publication. A new prompting paradigm to unleash the detection ability of multimodal LLMs. It modifies here the input space using human perceivable markers such as bounding boxes, numbers or shapes. And this helps our multimodal LLMs to focus on key image regions without altering any model parameter. And this is great. It reduced the visual hallucination and the language bias. So this was here an interesting paper in the development of those models. And by the way, if we're talking about literature, this is a nice one. This is about towards the reasoning era, a survey of long chain of sort for reasoning LLM models. And you also have multimodel chain of sorts. And you see here 25 January 25 February and March the amount of models they deal here with multimodal chain of sorts. Of course we cannot talk about it without Sam. So text prompted segmentation and I also told you it's called referring segmentation involves now extracting here the object segmentation mask themsel based on the natural human language queries. SAM segment anything model was absolutely famous at its time and we still use it today. However, those models lack here a structured knowledge summarization and a reasoning process between the text embedding itself and the segmentation for the mask. So they are also trained only on short explicit queries. Therefore when thereby limiting their ability to handle more abstract and more implicit reasoning. So now that you know all of this I think let's start let's start with this video. Let's start with the latest AI research. So what we do we develop a more efficient modular reasoning segmentation framework that integrates now the multimodal inherent cognitive reasoning capabilities with the structured visual segmentation and we do not need to fine-tune the whole complex. It is a simple two-step process. We start with a multimodal chain of sort visual prompting. I will show you this which enables now our multimodal LLM to reason here about specific object attributes and it will generate a region proposal. So this will means it zooms in on a very particular region in the image and it will say this is the image or this is the region in this image where we should be looking for the next step. The next step is the vision language segmentation module which refineses now those region proposals and now from this particular small region we have a further segmentation and this now builds our exact segmentation masks. This is the study we are talking about RSVP June 4, 2025 reasoning segmentation via visual prompting and multimodal chain of sort. So we exploit here the intrinsing reasoning and the localization capabilities of multimodal large language model through a structured chain of sort visual prompting plus three or four other EI systems. Let's have a look at this. So you have a simple question. You have here an image of a snail. Beautiful. And you see the objects that can protect that snail and prevent it from getting injured or what? So on this image now you have a rough localization that you're not going to believe it here that the multimodal LLM tells you in its linguistic capabilities. The chain of sword. Yeah, this is the shell of the snail. Plus since it's a multimodal it identifies here the localization of this object the shell. So we have a rough localization. The reasoning comes in the reasoning helps with the localization on the image. And then to get the absolute perfect precise mask we just have SAM segment anything model. And this gives us now here and in blue you have now the perfect mask and the official answer is the shell is the protective covering for the snail's body. It is rather easy if you know all the components if you know the strength and the weaknesses of all the components and how you form them and squeeze them together to get here a model that outperforms everything else. So this reasoning segmentation is now the new the challenging task and it is not about finding here find the red car in the image. This is the referring segmentation that we know and we know its limitation. This new stuff is now about segmenting an object based on an implicit complex query that requires common human sense human knowledge and a multi-step reasoning sequence. And the authors tell us hey our core insight investigating trying to explore this new element is to decouple the problem. We realized that the reasoning and the segmentation are two different skills and it makes no sense to squeeze them into one model. So they created a two-stage pipeline where each stage uses you the best tool, the best AI system, the best multimodal system, the best SAM model for the job without needing any fine-tuning. So great. So stage one, you've seen it, reasoning and rough localization. I call it the sinker model. No, if you want, this is the most innovative part of this new paper. So the goal here is not to come up already in step one with the perfect mask, but to have here the multimodal LLM correctly identify the target object and tell now the second element, the artist roughly in what regions to look for the object. And then at stage two, the artist comes in with a precise segmentation refinement and can draw now the segment the perfect segmentation mask. So stage one detail multimodal chain of sort the visual prompting. How do you get an LLM which syncs here in text in linguistic elements to talk about image location? It's easy. You just put a coordinate system over the image or a grid system, an overlay, whatever. And then the multimodal LLM can say, "Hey, the object that you are looking for is in vertical segment 456 and in the horizontal segment 5 6 7 8 of the image." Problem solved. The model with a multimodal chain of sort tells us then I don't just infer the object. I don't just locate the object in the image but it also provides here a rational the reason why I identified the correct object is and this object has the specific feature and this object is used in I don't know in a social context in the following form. So you have a rational that explains here the form functions and everything of this particular model. Now the output is easy. No, you have a textual description of the target. You have a set of horizontal and vertical region ids and you have an explicit human readable reasoning. At stage two is easy. You just use here the vision language segmentation module that you know. You crop the image to the region of interest. You use here a vision language encoder be it three. And then you have a segmentation. You have your classical SAM model and in the end you have an expert precise pixel mask of the object. Congratulation, you succeeded. Now if you're not really familiar with beat easy bird pre-trading of image transformer, beat exactly stands for birectional encoder representation from image transformer. Following here the bird development in the natural language processing area long time ago the authors proposed their mask image modeling task to pre-train the vision transformer and as you can see it is still valid today be it which stands for birectional encoder representation for the image transformer is still in use today yes even as a valid GitHub although I have to tell you last input was March 2023 for be version But you get the idea. Now, interestingly, this is where the story gets really exciting because the orers now testing this out, they found something and they said, you know what, at first we could kind of validate this hypothesis that by at first we separate here the task to their particular tools to the particular AI systems. And then we found that if we just improve here the step one the thinker the reasoning stage and we do not improve the stage two the performance got much better the better the reasoning process was and they tested this extensively and in the paper have a look at it you have hundreds here of of test data parameter validation data test sets here to demonstrate all of this I just give you here the result and as you see all the different methods and here the last four lines in the new model the RSVP model either you went with a lava 7B a Q12 vision language 7B a Gemini flash the old one or a GPT4 omni also an old one you got the best results of course with the latest methodology and they said hey with reasoning so using here the stage one process this massively outperforms everything else. So do not use the without reasoning a simpler method because you lose performance. And this also kind of proves here the decoupling hypothesis that the quality of stage one the reasoning coming now from the reasoning capacity of the multimodal large language model drives here the quality of the segmentation. So you see the importance of reasoning. If we get reasoning right then every other domino falls into place no problem at all. If we have one inherent system failure in the reasoning process it will through the complete visual reasoning system and yeah performance is gone. So therefore it is so important to get the right reasoning model just on the linguistic side on the LLM side. Yeah they said if a visual prompt with a grid based 9 * 9 was optimal beautiful you find here all the data. Yes you can imagine know the horizontal grid and the vertical grid and how you build this. Then they give you here the exact chain of sort prompt applied here to experiments where you ask your the system. Hey give me here the instance give me the idea the vertical ID the horizontal grid ids where the object is located give me now the complete reasoning and the AI provides all this information for you. Here I've shown you here a final result. So you start with an original image and you say I don't know okay if the horse is jumping over what particular object show me or identify the object. So you have at first a horizontal visual prompt and you build a vertical visual prompt for the multimodal large language model. Then you have if you want a zoom in on the region a visualization of the region where the object in question is located. You see here this on the bottom of the image and then you apply a SAM model a segmentation model and you got exactly location of this element of this object that is the content of your query. You see as easy as can be. Now of course you know we do have limitations. No and it depends on the quality here of the multimodal large language model especially on the reasoning performance. It all depends on the reasoning performance. The better it is, the better is the complete system. No, we have not yet found the optimal visual prompt design. Maybe we should use apply some DSPI optimization. But in general, I think it's a beautiful piece of engineering that the author showed us here to solve a problem. But not by brute forcing here our solution but by cleverly decomposing the complexity into two parts. You say hey I have one EI system that can do part A great and I have another EI system that can do part B great. So decomposition or decoupling here is really key for the success. You have the visual prompting as I just showed you here as a bridge function and you have explicit reasoning that enables you to really validate also as a human human readable text that you can immediately debug or identify any problem with the reasoning process. And I kind of like this example you know it's not complicated. It is not that you say my goodness it will change the world. But it is a nice prime example of a philosophy where you say don't brute force your way into a solution. Build intelligent pipeline with the right connectivity with the right AI system with the right tools and where you have specialized components that will collaborate and will solve the job given. Therefore, I hope you enjoyed this video. If you want to see more, why not subscribe and I see you in the next one.
Original Description
Segment VISUAL Intelligence with a new method called RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought.
Does visual reasoning need any LLM at all? Is all the reasoning intelligence suddenly in the vision model? I will answer your questions in this new video on Visual-Language models that outperform any other Ai system in complex reasoning. At the date of recording. Smile.
All rights w/ authors:
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal
Chain-of-Thought
Yi Lu 1,2*, Jiawang Cao 1*, Yongliang Wu 1,3*, Bozheng Li 1,4,
Licheng Tang 1, Yangguang Ji 1, Chong Wu 5, Jay Wu 1, Wenbo Zhu 1
from
1 Opus AI Research
2 University of Toronto
3 Southeast University
4 Brown University
5 City University of Hong Kong
#reasoning
#coding #mcp
#visual
#ainews
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Discover AI · Discover AI · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
Create a Smarter Future!
Discover AI
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
Microsoft and ChatGPU
Discover AI
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
ChatGPT - Can it Lie to you?
Discover AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
New BING ChatGPT loses its mind
Discover AI
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
New BING Chat AGGRESSIVE
Discover AI
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
Microsoft's CEO in Trouble #shorts
Discover AI
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
ChatGPT polarizes
Discover AI
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
ChatGPT: Multidimensional Prompts
Discover AI
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Medium · AI
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Medium · Programming
IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI
Dev.to AI
Fluid, natural voice translation with Gemini 3.5 Live Translate
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI