Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

📰 ArXiv cs.AI

HSC-MAE is a dual-path teacher-student framework for unsupervised audio-visual representation learning

advanced Published 7 Apr 2026

Action Steps

Propose a hierarchical semantic correlation-aware masked autoencoder framework
Implement a dual-path teacher-student architecture to enforce semantic consistency
Apply the framework to weakly paired, label-free audio-visual corpora
Evaluate the performance of the framework on multimodal embedding alignment tasks

Who Needs to Know This

AI engineers and researchers working on multimodal representation learning can benefit from this framework to improve the alignment of audio-visual embeddings

Key Insight

💡 HSC-MAE enforces semantic consistency across three complementary levels of representation to improve multimodal embedding alignment