Finding Distributed Object-Centric Properties in Self-Supervised Transformers

📰 ArXiv cs.AI

Researchers investigate how self-supervised Vision Transformers can discover object-centric properties without relying on image-level objectives

advanced Published 30 Mar 2026

Action Steps

Analyzing the limitations of using [CLS] token attention maps for object detection
Investigating alternative approaches to focus on object-centric information
Evaluating the effectiveness of self-supervised Vision Transformers in discovering distributed object-centric properties

Who Needs to Know This

Computer vision engineers and researchers working on self-supervised learning and Vision Transformers can benefit from this study to improve object detection and localization in images

Key Insight

💡 Self-supervised Vision Transformers can learn to focus on objects without relying on image-level objectives, improving object detection and localization

Key Takeaways

Researchers investigate how self-supervised Vision Transformers can discover object-centric properties without relying on image-level objectives

Full Article

Title: Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Abstract:
arXiv:2603.26127v1 Announce Type: cross Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric informatio

Read full paper → ← Back to Reads

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Key Takeaways

Full Article

Related Videos