Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia · Beginner ·📄 Research Papers Explained ·4y ago

Skills: Reading ML Papers80%

Key Takeaways

This video explains the concept of Self-Attention with Relative Position Representations as introduced in a research paper

Full Transcript

hello there it's finally here the long promised video explaining relative positional embeddings as they were introduced in prehistoric times by our ancestors in machine learning so if you're not yet sick of positional embeddings in Transformers and want more of them you are spending your coffee break with the right video because well we did a video about positional embeddings as introduced by the attention is all you need pay paper then we had a video about concatenating versus adding positional encodings and there we also discussed learned positional embeddings M Coffee Bean is now literally 50% caffeine and 50% positional information let's begin the attention is only un need paper introduces positional embeddings that encode the absolute position they encode the exact order of tokens such that the Transformer is informed about the Segal nature of data otherwise the Transformer that is processing everything in parallel would be invariant to order meaning that without positional encodings its output would not change after complete sequence reordering but what if we are dealing with other kinds of data where it is not about absolute position but about relative positions think about a graph for example it would be quite arbitrary to say that a certain node is the first one one what would make this node the first node and this the second that would be unmotivated for many problems and in some cases even misleading then let's forget about absolute order and move to relative positional encodings that are about the distances between elements in either a graph or a sequence which is in fact a degenerate graph a chain a subset of the authors of the attention is all un need paper also have a follow-up paper doing exactly this they introduce relative positional encodings where not the order but the relation or better said the relative position the distance between tokens is important so how does this work if you do not know anything about positional embeddings first go and watch our two videos about them and then come back don't worry they're short the idea of relative embeddings is moving away from the classical position embedding where each token has its own positional embedding with relative representations each word or token does not have only one positional embedding but as many positional embeddings as there are tokens in the sequence in order to describe the relationship between them this is because relative positional representations do not encode absolute Order anymore but a positional relationship in which each token stands to the other tokens so let's take an example sequence of five tokens in classical positional embeddings each token let's say X4 here has one encoding informing the Transformer about the position of X4 now to best visualize relative representations let's copy the sequence again like this in the relative variant each token has five positional embeddings one embedding for describing its relative position to its itself then four others for the rest of the sequence and to keep the same notation as the original paper the positional embedding describing the relationship of one token with itself is a vector w0 s0 is the distance between the token and itself then W1 and W2 as we move to the right and wus one W-2 and so on as we move to the left and the vector wi takes the same value independently of what tokens we are currently looking at because W wherever we are says that we are at zero hops away in the graph W1 says that we are one hop away to the left in a sequence or in a directed graph in general we go one hop following outgoing edges so you see this approach is not made only for sequences but for graphs too okay great for a sequence of length five positional embeddings should range from wus 4 to W4 so we have in total nine positional embeddings to either handcraft or learn for this sequence these W vectors can be written one under the other in a table like this notice that these vectors come from the relationship or distance between one token and the other so we can use a pairwise notation for this the authors also experiment with clipping at k meaning that after a certain distance positional embeddings get the same value so for k = 2 W3 and W4 all take the value of W2 okay but now for each token we are stuck with as many positional embeddings as the sequence has tokens what to do with all of these add them all up that wouldn't be a great idea addition is complicating things even with sinusoidal embeddings where we have just one positional encoding for each token because we could mix up the semantic and positional information with five positional embeddings per token the mixing up problem would be only five times larger H if only we had a mechanism that given a token computes new representations for that token in relationship to all other tokens well we do have this it's the self attention mechanism a clever idea would be to modify the self attention formulas to capture the relative positional representations too we remember that through self attention each token gets a new representation Z which is passed further in the Transformer module each token Vector is first transformed by a linear transformation then the new representation for each token is a weighted sum over all tokens where the weights are sort of an important score so fellow tokens that matter more are weighted more so here where the new representations are computed we can add the positional information now the token representation after the linear transformation is further shifted in the high dimensional space this means that each token XJ gets pushed into a Subspace saying that look I carry this semantic information but I'm also your second neighbor to your right because my position in some Dimension is similar to all other second neighbors okay great now the new representation Z is informed about the relative position but the self attention weight coefficients are not therefore the weight coefficients also receive their own positional push such that the important scorers make position informed decisions too so what do we have in the end now every token has a many relative embeddings as there are tokens in the sequence in two variants one relative embedding for the values and one for the keys to inform the attention weights and the exact values of the vectors are learned which kind of makes sense to let the model figure out for itself what the best balance is it would be quite difficult to handcraft all of these and what did this whole thing help with well this paper has only experimented with text which is a sequence obviously and they gain some performance in machine translation but word chains are only a flat graph and Beyond this the method can be applied to graph representations in general or anywhere where you have pairwise relationships between your elements because remember these relative representations depend only on how far tokens are from one another independently of whether we are in a sequence or in a more complex graph but miss Coffee Bean why didn't this paper try this out immediately well it's called marking your territory but there are other papers that implemented this successfully for graphs too anyway another upside of relative positional representations is that especially with clipping at K this learned positional information generalizes to any sequence length which is also a feature of the handcrafted sinusoidal embeddings but not NE necessarily of any learned positional embedding clipping at K seems a little odd because with small K it takes away the information of long-term dependencies now we can only store that something is close by or far away but we don't know how far away that is but this is similar with graph neuron networks where the best results are delivered with a small number of iterations many iterations allow for information from further away to diffuse used to your point of interest but multiple hops are not always beneficial as it can aggregate noisy signals from far away so clipping at K in relative positional representations acts a little like a relative position Event Horizon which does not hurt tasks which mostly rely on local dependencies well we hope that these relative embeddings and the associated paper are now more digestible to you thanks for staying until the end of this explanation and if you're still here do not forget to leave a like And subscribe you know to help with the YouTube algorithm okay bye [Music]

Original Description

We help you wrap your head around relative positional embeddings as they were first introduced in the “Self-Attention with Relative Position Representations” paper. ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ Related videos: 📺 Positional embeddings explained: https://youtu.be/1biZfFLPRSY 📺 Concatenated, learned positional encodings: https://youtu.be/M2ToEXF6Olw 📺 Transformer explained: https://youtu.be/FWFA4DGuzSc Papers: 📄 Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. "Self-Attention with Relative Position Representations." In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464-468. 2018. https://arxiv.org/pdf/1803.02155.pdf 📄 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 💻 Implementation for Relative Position Embeddings: https://github.com/AliHaiderAhmad001/Self-Attention-with-Relative-Position-Representations Outline: 00:00 Relative positional representations 02:15 How do they work? 07:59 Benefits of relative vs. absolute positional encodings Music 🎵 : Holi Day Riddim - Konrad OldMoney ✍️ Arabic Subtitles by Ali Haidar Ahmad https://www.linkedin.com/in/ali-ahmad-0706a51bb/ . ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔗 Links: AICoffeeBreakQuiz: https://www.youtube.com/c/AICoffeeBreak/community Twitter: https://twitter.com/AICoffeeBreak Reddit: https://www.reddit.com/r/AICoffeeBreak/ YouTube: https://www.youtube.com/AICoffeeBreak #AICoffeeBreak #MsCoffee

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 42 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

Chapters (3)

Relative positional representations

2:15 How do they work?

7:59 Benefits of relative vs. absolute positional encodings

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom