Self-attention from first-principles

📰 Reddit r/deeplearning

Hey Everyone, I am revisiting the transformer architecture (mostly vision transformers and their variants) from first principles and I've started writing about them. The first post (link above) is on what self attention is and how one can construct it. There is good amount of math. No hand wavy explanations. And it is surely not a learn self-attention in 60 seconds material. In fact, I do not mention the word transformers till the very en

Published 18 Jun 2026
Read full article → ← Back to Reads