Vision Transformer (ViT)
ViT is a pivotal paper in computer vision, bringing the powers of Transformers to the vision domain, and becoming a fundamental building block of many current vision models.
In this video, we delve into the intricate mechanisms of ViT, exploring how this influential model operates.
Reference: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", available at https://arxiv.org/pdf/2010.11929.pdf
Watch on YouTube ↗
(saves to browser)
DeepCamp AI