VL-KnG: Persistent Spatiotemporal Knowledge Graphs from Egocentric Video for Embodied Scene Understanding

📰 ArXiv cs.AI

VL-KnG is a training-free framework for constructing spatiotemporal knowledge graphs from egocentric video for embodied scene understanding

advanced Published 25 Mar 2026

Action Steps

Construct spatiotemporal knowledge graphs from monocular video
Bridge fine-grained scene graphs and global topological graphs without 3D reconstruction
Process video sequences to extract persistent memory and explicit spatial representations
Apply VL-KnG for embodied scene understanding in various applications

Who Needs to Know This

Computer vision engineers and researchers on a team can benefit from VL-KnG for improving scene understanding in video sequences, while product managers can leverage this technology for developing more accurate and efficient vision-language models

Key Insight

💡 VL-KnG provides a persistent memory and explicit spatial representations for vision-language models, enabling more accurate and efficient scene understanding