CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

📰 ArXiv cs.AI

arXiv:2604.11539v1 Announce Type: cross Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a

Published 14 Apr 2026
Read full paper → ← Back to Reads