Molmo2: Open-Source Vision-Language Models with State-of-the-Art Video Grounding

BazAI · Advanced ·🧠 Large Language Models ·6mo ago

Key Takeaways

The video presents Molmo2, an open-source vision-language model with state-of-the-art video grounding capabilities, and discusses its architecture, training, and evaluation. Molmo2 uses open data and achieves competitive performance in human preference tests, with a focus on retrieval augmented generation, fine-tuning, and multimodal learning.

Full Transcript

If you just, you know, look around you right now, [music] how much of the data that's streaming around you is visual or video. >> It's almost everything. >> It's almost everything. It's not just a part of our world anymore. It is our world. You've got social media feeds, Tik Toks, but also, you know, industrial sensors, autonomous cars, >> even the security camera on your front door. >> Exactly. It's this this absolute tsunami of visual data and trying to get real actionable meaning out of it that takes some serious computational muscle. >> And um historically that muscle has been kept under lock and key. We're talking about the most powerful vision language models, the VLMs that can watch a video and tell you exactly what's happening, right? But the problem, the big problem is that the models that consistently set the benchmarks, the ones from the huge tech corporations, they're completely proprietary, >> meaning you can't see what's inside. >> Nothing. You don't get the weights. You don't see the data they trained on. And uh crucially, you don't get the training recipe. >> And that lack of transparency, it's not just, you know, an inconvenience. It's a huge roadblock for open scientific research. It is. If you can't see the data or the method, you can't verify the results, you can't debug for biases, and you certainly can't innovate on your own. >> Precisely. And that that proprietary dominance is exactly what the Allen Institute for AI or AI2 is challenging with their new model series. It's called Mommo 2. >> Okay. So, let's really unpack this. This is a monumental effort. Our deep dive today is all about the mission behind Momo 2, which is to create a fully open weight, open data, and open code solution for video understanding >> and a state-of-the-art one at that. >> Right. And here's the real twist. They did all of this without relying on distillation. >> Yeah. And that's the key. Distillation is that practice where you train your open model using data that was actually generated by one of those secret closed proprietary models. >> So, it's not really open at all. >> It's tainted. They basically built their house from the ground up with completely open materials. >> It's a really profound commitment to truly open science. >> And they didn't just release one model. They released three different flavors of Momo 2. >> Okay, >> so you have a 4B and an 8B version. And those are built on the Quen 3 language model. But then, and this is important, they also released a 7B model that's built entirely on the OMO LLM. >> And Mo is AO2's own fully open language model. >> Exactly. So that 7B model proves that the architecture is versatile, that it can work beautifully on a foundation that is, you know, open in every single sense of the word. >> And for you listening, the most critical concept to get here isn't just that it watches a video. It's how it processes that information. The core capability that Malitude just nails is something called grounding. >> Grounding. >> So what does that mean? It means it doesn't just give you a narrative answer like a story. Right? When you ask it a question, it explicitly links your language, your query to the precise where, >> the spatial coordinates >> and the precise when, the timestamps. It can actually point to the answer in the video. >> And that ability to point, that's what separates a descriptive AI from a truly useful operational one, >> especially for things like robotics or, you know, advanced analytics. >> Absolutely. So, for the rest of our deep dive today, we're going to dissect exactly how AI2 pulled this off. We'll start by looking at the novel data sets they created. I mean, the sheer labor involved is incredible. >> It really is. >> Then we'll get into the clever architectural tricks they used for efficiency. And finally, we'll look at the results from their benchmarks and their human evaluations, which are well, they're surprising. >> And you'll see just how much harder video understanding is compared to just looking at a static image. >> So, where do we start? The skill set. >> Let's start with the capabilities. Let's shift gears and just dig into the Momo 2 skill set. what can this thing actually do that goes way beyond a basic what's happening in this clip kind of question. >> It seems like it was really built for versatility, right? Integrating that complex temporal reasoning with just standard visual skills. So, what are some of those standout functions? >> The versatility is uh staggering. Really is. It covers the full spectrum. Everything from super long, dense descriptive paragraphs, okay, >> down to very targeted short answer retrieval. So, take captioning for example. It's not just generating a generic oneliner. >> Give us an example of that that depth. >> Absolutely. The model can look at say the interior of a widebody airliner >> and then produce this caption that just meticulously details everything. The red carpeting in the aisles, the exact shade of the seat upholstery. >> Wow. >> The precise configuration of the overhead bins, even distinguishing between the two seat rows on the side and the bigger rows in the center. It's it's long form description that would rival a trained human observer. That level of detail is just im it's going beyond just identifying seats and aisle into material and spatial awareness. I remember seeing another example some about a vehicle. >> Yes, the long form analysis of the Jeep. It describes a black Jeep Cherokee XJ. So, it gets the model right, not just Jeep. It notes the characteristic boxy body style. It specifies the large aggressive tread tires. And it even details the heavyduty black steel bumper which has a visible mounted winch on it. So, it's doing fine grained recognition, identifying specialized aftermarket equipment. >> Exactly. But then, and this is a beautiful part of a VLM, it has to be able to pivot on a dime. It has to go from that almost poetic, verbose description to retrieving a single data point >> like a number. >> Exactly. It needs to handle short answer QA that requires OCR or just precise numerical retrieval from a complex scene like a financial report or a street sign. It can be asked, "How many digits are in the case number on the report page?" And the answer is just 11. Simple. Or, "What numbers on the plaque in front of the house written in gold?" And the answer is two. >> The way it seamlessly shifts between that narration and this pinpoint retrieval shows you just how robust its internal representation of the world is. >> And this is where the video element just complicates everything. Spadotemporal localization. In a video, things are moving, they're appearing, they're disappearing. The model has to track an identity across time. >> And this is where MO2 really starts to shine because of its uh its forced structural output. >> What do you mean by that? >> So take a query like how many different bison are visible in the video. A typical VLM might just spit out a number 16. >> Okay. >> MMO 2's response though it lists the key timestamps where it confirms different bison. It'll say T.5s, TA32.1s, T70.3s, and then it concludes there are 16 bison in the video. So it's showing its work. That structural output is vital because it proves the model isn't just guessing or you know looking up a number from some external data set. It's performing true temporal grounding. >> It's justifying its count, >> right? By connecting it to specific moments in that video stream. >> And that temporal understanding goes even further into behavioral analysis. They call it action sequencing. So, tracking movement, >> complex human movements. For example, describing a soccer goal celebration. Momo 2 gives you the whole sequence of events. The player wheels away from the goal area, flashes a brief tongue-out grin toward the flank, and sticks two fingers up toward the seating area. >> It's breaking the action down to discrete ordered components. >> That's so much more insightful than a basic summary like the player celebrates. >> It is. And that leads us right into its core strength, >> the precision of grounding. If you're aiming for operational usefulness, like monitoring machinery or guiding a robot, you need a model that can do more than just describe. It needs to physically point to things with accuracy. >> This grounding ability, this is their major competitive differentiator. They realize that if you structurally constrain the model to output spatial data, it forces its internal attention mechanisms to be more precise. So, let's dissect the technical side of this because the output format is well, it's pretty ingenious. It's this compressed plain text thing that looks kind of like an HTML tag, points, cords, right? Walk us through what's actually packed into that little string. >> That simple string is incredibly dense with information. >> First, it has the timestamp, which is crucial for video, and it's often down to a tenth of a second. >> Okay. >> Then, it has a unique integer object index. So, that's the sequence ID for that specific object instance. And finally, you get the normalized DUXY coordinates. They normalize everything to a 0 to 1,000 scale, which makes the location data model agnostic and really precise. >> Okay, hold on. I want to pause on that unique integer object index. If every distinct object gets a sequential ID starting at one, that sounds like it immediately helps with robust counting. >> It does, and that was one of their most significant insights from their ablation studies. when they test different components of the model, they found that forcing the model to perform counting by pointing was just vastly superior to asking it to predict a number directly. >> Why? Why is showing its work so much better than just giving the final answer? >> Well, because predicting a raw number lets the model hallucinate or just generalize based on language cues. If you ask it to count the mountain climbers and it's just predicting a number, it might pull a count from its general knowledge, right? But if the model is structurally required to output coordinates for each a climber, coordinate set one, coordinate set two, all the way up to coordinate set five, it's proving it has spatially identified and localized every single one. >> So the highest object index it lists that just is the final count >> that becomes the definitive count. It forces accountability into the model's reasoning process. >> And that structural requirement just fundamentally elevates the performance, which is what we see in the benchmarks, right? The gap between Momo 2 and its competitors on these pointing tasks is it's huge. >> It's absolutely enormous. If you look at the dedicated image pointing task, Momo 2 leads the point bench leaderboard. It outperforms models that were specifically designed for pointing like POF. >> Wow. >> But the video pointing benchmarks, that's where the rubber really meets the road. The Momo 24B model got an F1 score of 39.9. >> And just to remind everyone, F1 is a combined measure of precision and recall. >> It is. Now to put that 39.9 score in perspective, a major competitor, the Quen 3 VL8B model, which uses an even larger language model as its base, scored a tiny 1.5 F1. >> 1.5. That is that's not just a small difference. That suggests the other models either can't generate the pointing output correctly at all or their grounding is just wildly inaccurate. >> It's a different class of performance. That 39.9 shows that their training philosophy centered on this structured output and their highquality data, it paid off. >> Yeah. >> Radically better. >> It's the defining feature. >> It is. >> And these unique object IDs, they're not just for counting. They're the lynch pin for the next critical step in video analysis. Temporal tracking and reidentification. >> Correct. For a moving scene, you know, whether it's tracking individual penguins underwater or a horse crossing a stream, the model uses a similar structure, but it changes the tag to track cords. And the key is maintaining that unique object ID across multiple timestamps. >> That's the crucial part, maintaining that consistent unique ID potentially over minutes of video. If object number three is a specific penguin, object 3 has to remain that same penguin even if it swims behind a rock for a few seconds and comes out the other side. >> That sounds technically really difficult, especially if there's occlusion or if two very similar looking objects cross paths. How do you even measure the quality of that kind of tracking? >> So, they turned to a highly specialized metric. It's called the HTA score. That's high order tracking accuracy. >> Hota. >> And HA is great because it doesn't just measure whether you track something. It measures two separate but joint concepts. >> Okay. Lay those out for us. >> So, first you have detection accuracy. DA. Pretty simple. Did the model correctly spot the object in the right frame at the right time? That's the basic visual test, >> right? Second, and this is much harder, is association accuracy. Did it maintain the correct object ID consistently over time? It heavily penalizes any identity swaps between objects. >> And HOTA combines both of those. >> It jointly optimizes for both. They adapted the score for their point-based tracking by defining a match as when a predicted point falls precisely within an object's ground truth segmentation mask. This methodology is how they can prove Momo 2 is performing reliable reidentification over time. Something that models that just output a static count could never claim to do. >> Okay, that gives us a really powerful view of what Momo 2 is engineered to do. Now, let's pivot and look under the hood. If that's the capability, the next logical question is how do they even build it? Right? We need to dive into section two, the Momo 2 training recipe, the data and the efficiency innovations they use to achieve this without any of those proprietary shortcuts. >> And the foundational philosophy here is probably the most important part of the entire project. The researchers are explicit about this. Many of the current openweight models rely heavily on data generated through distillation. >> Using closed models like GPT4, LMA3 to create their instruction data. >> Exactly. And that practice means those open models are just inheriting the inaccessible biases and the capabilities of those closed systems. >> So if you're truly committed to open science, you just can't use that stuff. You have to start from scratch, which sounds like a herculean effort. >> It was. They had to create nine brand new data sets. Five were rigorously human annotated, which is incredibly expensive and slow, and four were synthetically generated. But and this is the key they used their own open video captioner Malmo 2Cap for the synthetic data generation. >> Yes, that ensured a fully open pipeline from start to finish. No external proprietary influence. >> So let's start with the human centric data because that's where the quality control really begins. They focused on highquality Q&A with a data set called Momo 2 ask model anything and they made it a point to avoid simple questions. >> Absolutely. The whole collection process was designed to elicit non-trivial reasoning based questions. They actively filtered out anything that was vague, too easy, or just simple counting. They didn't want the model wasting cycles learning things a toddler could answer. >> So, how did they structure that process to guarantee that complexity? It sounds like you'd need a pretty sophisticated feedback loop. >> They did. So first a human annotator would watch a video segment and write a complex non-trivial question something requiring temporal or fine-rained understanding. >> Okay. >> Then an LLM would provide a draft answer. But crucially human workers were then brought back in to iteratively refine both the final question and the final answer in a dialogue with the LLM. >> Ah so it's a conversation. It's an iterative refinement loop and it was essential for generating these really high quality, high relevance examples that push the boundaries of visual reasoning. It's way beyond simple object ID. >> That's fascinating. So it's human in the loop validation but applied to the question itself, not just the answer. Now shifting to the grounding task, they needed pinpoint accurate location data for training. That brings us to Momo Video Point. >> And Video Point was just pure painstaking labor and precision. Workers were required to watch videos, capture the relevant screenshots at the exact moment an action or object was salient, and then manually annotate points on every single relevant object instance. They recorded the precise video timestamp and the normalized coordinates for each one. >> I have to stop you there. Isn't manually pinpointing objects and videos just incredibly tedious and expensive and slow? How did the researchers justify that trade-off? committing to that intense human quality versus the massive scale that proprietary models get from using automated distilled data. >> That is the core economic and scientific challenge they faced. They recognize that while human annotation is slow, it provides a level of semantic fidelity and accuracy that distilled data just often lacks. >> It's more reliable >> much more. They trained on examples with up to 60 points annotated per frame. And they even included multi-turn conversations about the same video to make sure the model could handle iterative queries like find the pepper in the colander and then okay now which one of those peppers is the reddest. This dedication to fidelity even if it limited the scale was just non-negotiable for them to prove their open science thesis. >> Okay, so highquality human data sets that bust line but you still need massive scale to train a competitive VLM. Since proprietary distillation was off the table, how did they create these huge synthetic data sets without bringing in that external bias? >> This is where their clever synthetic data pipeline really kicked in. They generated four major synthetic data sets to scale up the video Q&A task. >> Let's start with Momo 2 cap QA and subtitle QA. How did they generate a million QA pairs without just getting generic lowquality output? >> The trick was in the segmentation. They use their own Momo 2 cap model, their open video caption or caption segmented video scenes rather than trying to describe a whole long sprawling clip at once. >> Ah, >> by focusing on smaller logically coherent scenes, they encouraged the model to generate these highly detailed descriptive captions at the scene level and then those detailed captions were fed to an LLM to generate the 1 million QA pairs. >> So focusing on segments keeps the model from getting lost in the temporal weeds. What about the 300,000 pairs from Subtitle QA? That sounds like it's combining two different modalities. >> It is. For subtitle QA, they used audio transcripts from Whisper 1, which gives you really highquality linguistic context. Then they prompted their LLM with both the visual content and the transcript. >> So, the questions require combining the two. >> Exactly. It generated QA pairs that required the model to synthesize information. reasoning over the visual context like what color is the boat in the linguistic context. What did the speaker say about the condition of the hull? It forces multimodal reasoning. >> And they didn't just stop at video. They also specifically tackled complex reasoning over textrich images. >> That's the MMO multi-image Hua data set. They created 188,000 synthetic examples focused on these really complex datads. Think financial documents, flowcharts, tables, diagrams. >> And the queries were complex, too. >> They required multi-step reasoning. You could have a query that asks the VLM to combine data from two different images. Something like based on the service pricing guide in image A and the customer service protocol manual in image B, what's the final approved refund amount? And does it need a manager's approval? >> That's a realworld highv value skill. It's simulating the kind of task an employee might actually face juggling several documents at once. >> Exactly. It moves way beyond simple OCR into cross-document inferential reasoning. >> All right, we've established the massive commitment to data, but training these huge models on, you know, 16,000 token sequences is incredibly resource inensive. That brings us to the engineering brilliance required for training throughput. Let's talk about packing. Why was that so necessary? >> The need for packing comes from the fact that the sequence lengths just vary drastically. A simple text response might be a few hundred tokens. A fulllength video with all its frames and subtitles and annotations can push well past 16,000 tokens. >> Right? >> If you use standard training, you have to pad all those short examples up to the maximum length. So, we're essentially paying for blank space and just burning GPU cycles for nothing. >> It's like running a full-size semi-truck just to deliver a single envelope. It's it's possible, but it's ludicrously inefficient. >> That's a perfect analogy. So, Momo 2 solved this with their dynamic packing algorithm. And this isn't some static pre-calculated thing. It's integrated right into the data loader and it works on the fly. >> How does an algorithm pull off that kind of logistical challenge in real time? >> It maintains a dynamic pool of examples in memory. When a batch is requested, it uses a dynamic programming solver, which basically a highly optimized search function to find the best combination, the optimal subset of short examples that can be merged together into one single sequence. >> And it's trying to get as close to that 16,384 token limit as possible. >> Exactly. It maximizes the total number of text tokens plus a weighted count of image crops to fit perfectly under the limit. >> That sounds incredibly complex to engineer. But what was the payoff? Why was this a gamecher for open research? >> The payoff was staggering. A 15x training efficiency boost during supervised finetuning. >> Wow. >> For an open source team where computing resources are always always the bottleneck. A 15x efficiency increase is just transformative. It meant they could average about 3.8 full examples into one 16k token sequence just maximizing the use of that GPU memory. >> So that efficiency gain essentially bought them 15 times more training runs for the same budget >> pretty much. They noted that extracting the frames from the video itself was still the most expensive part, but packing ensured that every computational step after that was fully utilized. >> Okay. So related to this packing challenge is how they structure the data internally when you have multiple annotations for a single visual input. They call these message trees. >> Yes. The message tree is a conceptual framework for managing this complex multitask data. When a video has multiple annotations, say three different QA pairs and one pointing instruction, the visual input itself, all the frames and image tokens, that's designated as the root message. >> Okay, >> each individual annotation, the QA pair, the pointing instruction becomes a separate branch connected to that root. >> And the clever part is managing attention within that structure. You don't want the different branches talking to each other, but they all have to be able to see the root. >> Precisely. They use these custom carefully engineered attention masks and these masks enforce that rule. They prevent the different annotation branches from seeing or attending to each other so the tasks stay independent. The masks explicitly allow the image and frame tokens the root to cross attend to all the branches. This improves information utilization because the model is processing multiple distinct learning signals for the same visual content at the same time without corrupting the individual tasks. >> That is meticulous engineering. Okay. Finally, let's talk about token waiting. This deals with the problem of balancing the loss function when your outputs vary so wildly in length. >> This is a crucial protection I mean you have to have it when you're combining descriptive and retrieval tasks. Imagine in the same batch you have a long video caption that generates say 4,000 tokens of output and then you have a simple multiple choice question that generates just one token. >> Right? >> If you calculate the loss normally, those long dense examples, even if they're sampled rarely, they contribute the vast majority of the loss calculation. >> So the long captions are basically shouting over the short answers during training. They're disproportionately influencing the model's weights. >> That's exactly right. The model learns to optimize for being verbose because that's what dominates the loss. And as a consequence, its performance on essential short answer tasks just degrades drastically. >> So how do they damp that shouting? What did they do to balance the influence? >> They applied two key adjustments to the loss weights. First, they used a fixed very low weight, just 0.1 for tasks that are inherently verbose like the video captions and the detailed pointing outputs. They just accepted a lower loss contribution from those tasks to protect the others. >> Okay. Second, for all the other tasks, especially short answer QA, they implemented a heristic based on the square root of the number of answer tokens, the formula was $4. >> Why the square root function specifically? >> The square root is just mathematically really well suited to managing magnitude. If your answer has a 100 tokens, the raw loss could be massive. But if you take the square root of 100, which is 10, you scale that contribution way down. The four in the $4 square coefficient helps normalize it further. So it makes sure that a longish answer doesn't overwhelm the loss, but also that a single token answer isn't just totally ignored. >> Exactly. It's a sophisticated way to find equilibrium in a loss landscape that's just dominated by token cap variance. >> That truly is a masterclass in efficiency and data management. It proves that an open approach when it's executed this brilliantly can absolutely compete. So now let's look at the results. Let's move on to section three, model performance and long context innovation. Did all this open- source effort actually pay off on the cold hard benchmarks? >> The headline is unequivocally positive. Momo 2 isn't just competitive, it's demonstrably state-of-the-art among openweight models. It robustly outperforms previous open data models, including their own earlier Momo 1 series. >> So, where did it shine the brightest on the numerical scores? >> Its general QA capabilities were a massive strength. It got state-of-the-art performance among openweight models on benchmarks like VQA, V2.0 and real world QA >> which confirms that their complex human refinement strategy for the QA data really worked. >> It did. And we also know that counting is a formidable strength which aligns perfectly with its grounding focus. It performed exceptionally well on the really challenging Pixmo Count test set. >> And the beauty of the MLMA 2 approach is that the underlying language model foundation that stays strong. >> That's right. The MMO 2, 4B, and 8B versions perform comparably to their powerful base LLM Quinn 3 on pure language tasks like math and general knowledge. In fact, they even perform slightly better on the ARC challenge, which tests abstract reasoning. >> So, the vision training didn't compromise their linguistic capabilities, >> right, which is a common fear. >> But when you're pushing the frontier, you're going to find gaps. If it's so efficient and so focused on precision, why is the model still lagging in some of these specialized tasks? Where did that performance gap show up? >> The gaps were noticeable in areas that require really complex specialized multimodal reasoning and heavy OCR. The model is a bit behind the best openweight models on benchmarks like Math Vista and MMU. >> And what are those testing? >> They demand complex mathematical problem solving that's integrated with visual data. And they also noted struggles on OCR heavy tasks, specifically DOCVQA and InfoQA. >> So what are these benchmarks testing that Momo 2 hasn't fully mastered yet? >> They're testing true inferial reasoning under specialized constraints. Mathista, for example, it requires you to read data from a graph or a chart and then perform multi-step arithmetic operations based on the data you extracted. >> I see. >> And DOCVQA is about reading complex document structures like tables with nested headers. It suggests that while their synthetic data efforts were impressive, these highly specialized domains where the text is often small or blurry or formatted nonlinearly, they still need more targeted training data >> and possibly dedicated pre-processing pipelines. >> Possibly the current efficiency gains are not yet translating perfectly to that kind of highly domain specific acuity. >> Okay, let's move beyond the raw technical scores and talk about the ultimate metric, human preference. How did Malmo 2 actually fare in direct comparison tests against the proprietary giants? >> They conducted an incredibly rigorous human evaluation. [snorts] They collected 450 unique open-ended questions based on videos and gathered over 105,000 pairwise ratings from human evaluators. >> Wow. 105,000 ratings. That's a huge volume. How do you even process that many pair wise ratings into a definitive ranking? They use the Bradley Terry model to calculate an ELO ranking >> like in chess. >> Exactly like the ranking system in competitive chess. If model A is preferred over model B, 70% of the time model A gets a higher ELO score. It's designed to give you a statistically robust measure of qualitative human preference. >> So what did that ELO ranking tell them? Where did Momo 2 land? >> Momo 28B ranked fifth overall, which places it right there in the mix with models from the major corporations. But critically, it ranked first among all the models that rely purely on open data. >> That's a huge win for them. >> It is. And for the open-ended QA cast specifically, LMA28B showed a clear advantage over the Quinn 3VL models it was competing against, even though they use similar base LLMs. >> And their pair-wise win rates, they confirm this competitiveness. >> Right. Absolutely. The Momo 28B model achieved a 53% win rate head-to-head against the Quen 3 VL8B model. And even the smaller Momo 24B model won 51% of its comparisons against its counterpart. >> So that shows that their training methodology, the structural changes and the open data philosophy, it translated into answers that humans actually preferred >> a genuine competitive edge. >> Yes. What were the qualitative insights? What did the human raers actually say? Why did they prefer Momo 2's answers? >> The feedback was really illuminating. Humans preferred Momo 2's answers in QA because the model was just smart about its verbosity. It gave detailed explanations when the question required complex reasoning, but it stayed concise when a simple direct answer was enough. >> So, it matched its verbosity to the users need better than its competitors. >> It did. However, the raiders also pinpointed a weakness. The model sometimes struggled with long captioning tasks. It would occasionally produce repetitive or even nonsensical content near the end of a really long output. >> And what's the suspected cause of that? Is that a Mala 2 problem or a foundation model problem? >> It's likely a bit of both. The foundation LLM itself, Quen 3, has a known tendency for repetition or, you know, divergence during extremely long generation sequences. >> Okay. >> And compounding that, the Momo 2 team noted that the amount of highquality captioning data they had was relatively limited compared to their QA data. So, the model just didn't have enough fine-tuning to really stabilize its long- form generation ability. Let's shift now to what might be the single biggest operational challenge in video understanding. Processing extremely long video contexts efficiently. I mean analyzing a 5-second clip is one thing. Analyzing a 5-minute recording of a construction site is exponentially harder. >> The challenge is fundamental. It's physical. It's computational constraints. To analyze a long video thoroughly, you need to sample a lot of frames. But every single frame you sample converts into a massive number of visual tokens that get fed into the language model. >> Right. And even with their dynamic packing, the system has a hard cap. You're limited to around 600 visual tokens. Exactly. >> So, how do they maximize understanding for those long videos without just causing an overload? >> They employed the sophisticated test time strategy they call slow fast encoding. The concept leverages resource allocation by varying the pooling strategy for different frames. >> Okay, break that down. Think of it like a camera operator deciding where to use a highdeinition close-up lens versus where to just use a grainy wide shot. Every frame is represented, but it travels down one of two pathways. >> The slow or the fast pathway, >> right? The slow pathway uses high spatial detail 3x3 pooling, so you get a sharp close-up. The fast pathway uses low spatial detail 9x9 pooling, which gives you a blurry grainy background check. And this choice is governed by a periodicity parameter in toque dollars. So it effectively spreads the token budget. But if you just select the high detail frames arbitrarily, you might miss the important action. The real innovation, the crucial improvement was making that selection intelligent. >> That is the absolute key. Query based selection. Instead of just periodically selecting, say every tenth frame for high detail, which is the standard approach, they use the text query itself to inform which frame should get that expensive highdetail treatment. >> That is incredibly smart. How does the query actually translate into a selection mechanism? So the model first embeds the text query and it embeds every single frame in the video. Then it calculates the cosine similarity score between the query embedding and each of the frame embeddings. >> And cosign similarity tells them how conceptually close the query is to the content of a specific frame. >> Exactly. The highest scoring most relevant frames are then dynamically selected for that highdetail slow pathway. So if the question is about the worker wearing a red helmet, the model makes sure that the three frames where that worker is clearest get the high resolution treatment and it effectively ignores the irrelevant frames in high detail. >> Precisely. It's optimizing its limited GPU memory based on semantic relevance. And the results show that the standard periodic sampling actually hurt performance on long videos. But using this querybased selection gave a significant boost to long video understanding with only minor acceptable regression on short videos. And it reduced the token count. >> It did. It efficiently closed that performance gap while reducing the total number of visual tokens needed by about 43%. It's efficiency optimized for intelligence. >> That is a perfect demonstration of how targeted architectural innovations can overcome these massive computational bottlenecks. Okay, let's synthesize everything we've covered and conclude this deep dive into Malmo 2. >> I mean, Momo 2 stands as a major scientific achievement. It's not just a competitive model. It is a proof of concept for open science. It demonstrates definitively that state-of-the-art video LMS can be built without relying on those proprietary, opaque, and potentially biased data distillation pipelines. >> And the core technical achievement is that synthesis of their whole training strategy. We saw those revolutionary efficiency optimizations like dynamic packing boosting throughput by 15x. >> A huge win for open teams. And that infrastructure supports a superior grounding capability, enabling that precise pointing and counting that other open models just can't seem to match. That massive F1 score difference is the proof. >> And yet, the AI2 report is impressively candid about the limitations that still exist. They noted that video grounding is still just fundamentally challenging. It remains less consistent than image grounding. >> Right? We noted that even Momo 2, despite its superiority, didn't get accuracy rates above 40% on key video grounding metrics. You see 70 90% accuracy on similar image tasks. >> So why is that gap between video and image still so dramatic even for a model this good? >> It really boils down to two main factors. Both are tied to just inherent complexity. First, video analysis demands processing vastly more content. It's just computationally harder. Then second, >> second, and most critically, video requires robust reidentification. >> The capacity to consistently recognize whether an object in frame A is the exact same object in frame Z minutes later. This identity tracking is a major technical hurdle. They also noted issues with having to process lowresolution frames for very long videos. And interestingly, a weakness in their tracking data itself. The point marker for a single tracked object can sometimes flicker, meaning the location of the annotated point changes slightly frame to frame even if the object hasn't really moved. >> So if the ground truth data itself flickers, the model is going to struggle to learn absolute consistency. Exactly. >> So we have worldclass language models and we have excellent visual encoders, but that consistency over time, that reidentification and robust tracking across extended contexts, that is the definitive remaining bottleneck right now. That is the challenge the next generation of researchers has to solve. >> Which leaves us with our final provocative thought for you, the learner. The Momo 2 report makes it clear that the weakest link in high performance video analysis isn't basic vision or language anymore. It's the complex, consistent, long-term reidentification and tracking of a single object across a long video bra. A weakness made worse by things like flickering point markers in the training data itself. So, if the next generation of open models can solve this deep technical challenge, if they can achieve absolute point marker consistency and object identity across frames, how quickly will that single breakthrough unlock reliable generalpurpose autonomous applications? Things like large-scale traffic monitoring, surgical robotics, or complex sports analytics tasks that Momo 2 is explicitly designed for but can't yet perform perfectly. Something to mull over as you watch these systems continue to mature.

Original Description

We present Molmo2, a new family of fully open state-of-the-art vision-language models (VLMs) developed by the Allen Institute for AI and the University of Washington. While most strong models today are proprietary, Molmo2 provides the community with open weights, open data, and open code. Key Highlights of Molmo2: • Superior Grounding: Molmo2 excels at point-driven grounding and object tracking in single images, multi-image sets, and videos. • Fully Open Data: Unlike other models, Molmo2's training data (including 9 novel datasets) was constructed from scratch without distilling from proprietary models like Gemini or GPT. • Performance: The 8B model significantly outperforms existing open-weight models and even surpasses proprietary systems like Gemini 3 Pro on specific tasks like video pointing and tracking. • Model Variants: The family includes Molmo2-4B, Molmo2-8B (based on Qwen3), and Molmo2-O-7B (built on the fully-open OLMo architecture). • New Capabilities: It handles dense video captioning, complex counting, and spatio-temporal localization with high precision. Learn more about the training recipe involving bi-directional attention, efficient sequence packing, and message-tree encoding that allows these models to reach top-tier performance while remaining transparent. Try the Demo: playground.allenai.org Github Code: github.com/allenai/molmo2
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UUOthur5d9OxdqEh08Swtirw · BazAI · 16 of 49

1 How LLM Agents Actually Do Deep Research (Planning, Tools & Citations Explained
How LLM Agents Actually Do Deep Research (Planning, Tools & Citations Explained
BazAI
2 Kafka vs RabbitMQ Explained: Which One Should You Use?
Kafka vs RabbitMQ Explained: Which One Should You Use?
BazAI
3 #NOVER Explained: How AI Learns to Judge Its Own Reasoning (No Reward Model Needed)
#NOVER Explained: How AI Learns to Judge Its Own Reasoning (No Reward Model Needed)
BazAI
4 The State of Enterprise AI 2025: How Workers Save 60 Minutes Daily & Adoption Explodes 9X
The State of Enterprise AI 2025: How Workers Save 60 Minutes Daily & Adoption Explodes 9X
BazAI
5 NVIDIA Nemotron 3: 1M Context, Hybrid MoE Architecture, and Open Source AI Agents
NVIDIA Nemotron 3: 1M Context, Hybrid MoE Architecture, and Open Source AI Agents
BazAI
6 How Service Mesh Works: Data Plane, Control Plane & Observability
How Service Mesh Works: Data Plane, Control Plane & Observability
BazAI
7 How to Design Safe Retries in Microservices (No Duplicates, No Overload)
How to Design Safe Retries in Microservices (No Duplicates, No Overload)
BazAI
8 Step-GUI: The Self-Evolving AI Agent for Android & PC (SOTA Performance!)
Step-GUI: The Self-Evolving AI Agent for Android & PC (SOTA Performance!)
BazAI
9 NVIDIA's NitroGen: The First Generalist AI Trained to Play 1,000+ Games by Watching
NVIDIA's NitroGen: The First Generalist AI Trained to Play 1,000+ Games by Watching
BazAI
10 How AI Agents Remember: The Evolution of Agentic Memory (2025 Guide)
How AI Agents Remember: The Evolution of Agentic Memory (2025 Guide)
BazAI
11 Automate Your AI Data Pipelines: Introducing DataFlow & DataFlow-Agent
Automate Your AI Data Pipelines: Introducing DataFlow & DataFlow-Agent
BazAI
12 Nemotron 3 Explained: Hybrid Mamba + MoE for 1M Token Agents
Nemotron 3 Explained: Hybrid Mamba + MoE for 1M Token Agents
BazAI
13 Build Your Own AI Voice Agent (LangChain + OpenAI + AssemblyAI + Cartesia)
Build Your Own AI Voice Agent (LangChain + OpenAI + AssemblyAI + Cartesia)
BazAI
14 Langflow 1.7 Explained: CUGA, ALTK, MCP & the Death of Prompt Engineering
Langflow 1.7 Explained: CUGA, ALTK, MCP & the Death of Prompt Engineering
BazAI
15 HuatuoGPT-o1: The First Medical AI That "Thinks" Before It Answers
HuatuoGPT-o1: The First Medical AI That "Thinks" Before It Answers
BazAI
Molmo2: Open-Source Vision-Language Models with State-of-the-Art Video Grounding
Molmo2: Open-Source Vision-Language Models with State-of-the-Art Video Grounding
BazAI
17 MAI-UI: Alibaba’s New Foundation GUI Agents Outperforming Gemini & GPT-4o
MAI-UI: Alibaba’s New Foundation GUI Agents Outperforming Gemini & GPT-4o
BazAI
18 Seamless AI Object Insertion: Bridging 4D Geometry and Diffusion Models
Seamless AI Object Insertion: Bridging 4D Geometry and Diffusion Models
BazAI
19 5 AI Agentic Workflow Patterns-Reflection, Tools, ReAct, Planning, Multi‑Agent
5 AI Agentic Workflow Patterns-Reflection, Tools, ReAct, Planning, Multi‑Agent
BazAI
20 #NVIDIA's New #SurgWorld: How AI is Learning Autonomous Surgery
#NVIDIA's New #SurgWorld: How AI is Learning Autonomous Surgery
BazAI
21 CQRS Explained in 3 Minutes: How Modern Systems Scale Reads vs Writes
CQRS Explained in 3 Minutes: How Modern Systems Scale Reads vs Writes
BazAI
22 Docker Explained in 3 Minutes: How Containers Actually Work
Docker Explained in 3 Minutes: How Containers Actually Work
BazAI
23 6 Practical AWS Lambda Patterns in 3 Minutes (Real‑World Serverless Guide)
6 Practical AWS Lambda Patterns in 3 Minutes (Real‑World Serverless Guide)
BazAI
24 Containerization Explained in 3 Minutes: From Dockerfile to Running Containers
Containerization Explained in 3 Minutes: From Dockerfile to Running Containers
BazAI
25 Science Context Protocol (SCP)- Global Web of Autonomous Scientific Agents
Science Context Protocol (SCP)- Global Web of Autonomous Scientific Agents
BazAI
26 Youtu-Agent: Scaling LLM Agent Productivity via Automated Generation and Hybrid RL
Youtu-Agent: Scaling LLM Agent Productivity via Automated Generation and Hybrid RL
BazAI
27 #DeepSeek’s #mHC Breakthrough: Stabilizing Hyper-Connections for Large-Scale LLM Training
#DeepSeek’s #mHC Breakthrough: Stabilizing Hyper-Connections for Large-Scale LLM Training
BazAI
28 Message Brokers 101 in 3 Minutes: Queues, Pub‑Sub & Competing Consumers Explained
Message Brokers 101 in 3 Minutes: Queues, Pub‑Sub & Competing Consumers Explained
BazAI
29 Must‑Know Message Broker Patterns: Outbox, CQRS, Saga & More
Must‑Know Message Broker Patterns: Outbox, CQRS, Saga & More
BazAI
30 Confucius Code Agent-Scalable Scaffolding for Large-Scale Repositories
Confucius Code Agent-Scalable Scaffolding for Large-Scale Repositories
BazAI
31 #nvidia  Just Fixed #GRPO! Meet #GDPO: The New Standard for Multi-Reward RL
#nvidia Just Fixed #GRPO! Meet #GDPO: The New Standard for Multi-Reward RL
BazAI
32 NVIDIA Alpamayo-R1: Real-Time Reasoning for Level 4 Autonomy
NVIDIA Alpamayo-R1: Real-Time Reasoning for Level 4 Autonomy
BazAI
33 The Future of AI Memory: Meet #AtomMem’s Learnable CRUD System
The Future of AI Memory: Meet #AtomMem’s Learnable CRUD System
BazAI
34 Database Sharding Explained | Range vs Hash vs Directory Sharding
Database Sharding Explained | Range vs Hash vs Directory Sharding
BazAI
35 12 Architecture Concepts Every Developer Must Know | System Design Explained
12 Architecture Concepts Every Developer Must Know | System Design Explained
BazAI
36 5 Rate Limiting Strategies Explained | Protect Your System at Scale
5 Rate Limiting Strategies Explained | Protect Your System at Scale
BazAI
37 How Live Streaming Works | System Design Explained
How Live Streaming Works | System Design Explained
BazAI
38 5 Leader Election Algorithms Explained | Distributed Systems & Databases
5 Leader Election Algorithms Explained | Distributed Systems & Databases
BazAI
39 6 Prompting Techniques to Get Better Results from ChatGPT
6 Prompting Techniques to Get Better Results from ChatGPT
BazAI
40 Complete Guide to Storage Systems: RAM, SSD, SAN, Cloud & Databases
Complete Guide to Storage Systems: RAM, SSD, SAN, Cloud & Databases
BazAI
41 Top 4 Authentication Mechanisms Explained | SSH, OAuth, SSL & Passwords
Top 4 Authentication Mechanisms Explained | SSH, OAuth, SSL & Passwords
BazAI
42 Common Network Protocols Explained | TCP, UDP, HTTP, DNS & More
Common Network Protocols Explained | TCP, UDP, HTTP, DNS & More
BazAI
43 Microservices Best Practices | 9 Rules Every Architect Must Know
Microservices Best Practices | 9 Rules Every Architect Must Know
BazAI
44 8 Network Protocols Every Engineer Must Know | HTTP, TCP, UDP & More
8 Network Protocols Every Engineer Must Know | HTTP, TCP, UDP & More
BazAI
45 Distributed Systems in 3 Minutes: CDNs, APIs, TCP & Idempotency Explained
Distributed Systems in 3 Minutes: CDNs, APIs, TCP & Idempotency Explained
BazAI
46 Must‑Know Message Broker Patterns in 3 Minutes (Outbox, CQRS, Saga & More)
Must‑Know Message Broker Patterns in 3 Minutes (Outbox, CQRS, Saga & More)
BazAI
47 Is OpenClaw Safe? The "Security Nightmare" Behind the Viral AI Agent
Is OpenClaw Safe? The "Security Nightmare" Behind the Viral AI Agent
BazAI
48 JWT vs Sessions vs PASETO — Which Authentication Should You Use?
JWT vs Sessions vs PASETO — Which Authentication Should You Use?
BazAI
49 Recursive LLMs vs Big Context Windows: Why RLM Wins
Recursive LLMs vs Big Context Windows: Why RLM Wins
BazAI

The video introduces Molmo2, an open-source vision-language model with state-of-the-art video grounding capabilities, and discusses its architecture, training, and evaluation. Viewers can learn how to build and fine-tune vision-language models, and apply them to real-world tasks.

Key Takeaways
  1. Build and train a Molmo2 model
  2. Fine-tune the model for specific tasks
  3. Evaluate the model on benchmarks
  4. Integrate the model with other AI systems
  5. Apply the model to real-world tasks
  6. Use retrieval augmented generation and multimodal learning to improve model performance
  7. Optimize model performance with dynamic packing and query-based selection
💡 Molmo2 achieves state-of-the-art video grounding with precise pointing and counting, and its open-source nature and competitive performance in human preference tests make it a valuable resource for the AI community.

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience
Medium · Machine Learning
Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.
Learn about Anthropic's Project Glasswing, a $100M club for its smartest AI, and understand the strategy behind it
Medium · LLM
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications
Dev.to AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →