Pixelis: Reasoning in Pixels, from Seeing to Acting

📰 ArXiv cs.AI

Pixelis is a pixel-space agent that enables reasoning in pixels from seeing to acting, allowing for more generalizable and physically grounded visual intelligence

advanced Published 27 Mar 2026

Action Steps

Define executable operations for image and video processing, such as zoom/crop, segment, track, OCR, and tempo
Implement a compact set of operations that can be composed to achieve complex tasks
Train the Pixelis agent to operate directly on images and videos, learning through action rather than static description
Evaluate the performance of Pixelis on various tasks, including visual question answering and action planning

Who Needs to Know This

Computer vision engineers and AI researchers on a team can benefit from Pixelis as it enables more dynamic and interactive visual intelligence, allowing for safer improvement under distribution shift

Key Insight

💡 Learning through action is essential for generalizable and physically grounded visual intelligence beyond curated data