SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

📰 ArXiv cs.AI

SAVe is a self-supervised audio-visual deepfake detection framework that exploits visual artifacts and audio-visual misalignment to detect deepfakes

advanced Published 27 Mar 2026

Action Steps

Learn from authentic videos without relying on curated synthetic forgeries
Exploit visual artifacts and audio-visual misalignment for deepfake detection
Train a self-supervised model to detect inconsistencies between audio and visual modalities
Evaluate the model on unseen manipulations to test its scalability and robustness

Who Needs to Know This

AI engineers and researchers working on deepfake detection and multimodal analysis can benefit from SAVe, as it provides a robust and scalable solution for detecting subtle visual artifacts and cross-modal inconsistencies

Key Insight

💡 Self-supervised learning can be effective for deepfake detection, reducing dependence on curated synthetic forgeries and improving scalability and robustness

Key Takeaways

SAVe is a self-supervised audio-visual deepfake detection framework that exploits visual artifacts and audio-visual misalignment to detect deepfakes

Full Article

Title: SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Abstract:
arXiv:2603.25140v1 Announce Type: cross Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic vide

Read full paper → ← Back to Reads