Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia · Advanced ·🧠 Large Language Models ·4y ago

Skills: LLM Foundations90%Prompt Craft60%

Key Takeaways

The video discusses how Transformers process sequences of fixed or variable length, explaining that while efficient implementations may fix the length to a maximum, in principle, Transformers can process sequences of variable length due to positional embeddings and feed-forward layers.

Full Transcript

hi there we had this question on our ai coffee break quiz recently and we must discuss its answer here because we never had such a contested question before on our quiz so if you do not know what i'm talking about here follow miss coffee bean to the community tab of this channel because we have quite the set of machine learning related quiz questions and the new question is coming each day but let's get to our question about the sequence length of the transformer it's not a simple question not gonna lie the answer depends a bit on how one reads it christian garcia noted very well that transformers do not process just sequences but something more general a sequence without order which is a set the sequential nature of the data becomes clear to the transformer only because we add positional embeddings to encode the order in which the data comes in if it is the case for more about this check out some of our previous videos but supposing we are processing a sequence with a given model for example sequence of words does the sequence always have to have the same length or can this length vary and it's a valid question because there are architectures like convolutional architectures with a fully connected layer in the end where the input must have always the same size if the image is not the right size it gets resized during pre-processing so we want to answer this question for the transformer in two ways first by looking into the code and then theoretically because well one just does not need to play around a lot with a transformer from hugging phase to notice that inputs which have different lengths are possible so yes the things we want to process with a transformer can have variable lengths the tokenizer just gives us more token ids as output and an attention mask full of ones but with more ones as the sequence grows bigger but because of the quadratically scaling attention in compute we also notice parameters in the transformers library limiting the sequence length which has the effect of truncating everything which exceeds that length to keep memory and computation in bounds looking at the bird tokenizer documentation we also see the pad token and its explanation by padding the sequence to a maximum length we can produce inputs of fixed length for the transformer even when the sequence is shorter than the maximum length our input length has always the same size because the sequence is padded with the pad token and the attention mask is padded with zeros and all this trouble is for efficiency reasons in the implementation when we want to train or do inference on more than just one sample at once because in that case tensors in a batch should have the same size such that the gpu can handle this efficiently therefore the answer to this question of whether transformers process sequences of a predefined length or a variable length is both yes and no yes sequence length can vary as in yes the transformer can process any length but also no because efficient implementations fix the length to a maximum length anyway but in principle and i think that it is the most important thing about transformers if everything with positional encodings and training works well for every sequence length there's no reason for why transformers should be bound to a specific set size or sequence length because the feed forward layers we see here are the same no matter the position of the token we are looking at so even if we add here one more token at the end this does not care as we take the same feed forward module and its weights and apply them on the newly added tokens and the attention is also not caring about the sequence length as it just computes a pairwise score given a vector and another vectors so it just looks at vector pairs at a time therefore miss coffee bean would rather say that transformers process sequences of variable length in principle okay bye [Music] [Music] you

Original Description

Explaining the answer to the following AI Coffee Break Quiz question: “Do Transformers process sequences of fixed or of variable length?” ❓ Find the Quiz Question here: https://www.youtube.com/post/UgkxuvOZ9Un4JaQXwwEzY0Vu6v753m1JoPkx Read more: 🔗 https://huggingface.co/transformers/model_doc/bert.html 🔗 https://stats.stackexchange.com/questions/488844/self-attention-for-variable-length-sequence-classification Outline: 00:00 The quiz question 01:26 The code 03:11 The theory Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏 donor, Dres. Trost GbR, Yannik Schneider ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔗 Links: AICoffeeBreakQuiz: https://www.youtube.com/c/AICoffeeBreak/community Twitter: https://twitter.com/AICoffeeBreak Reddit: https://www.reddit.com/r/AICoffeeBreak/ YouTube: https://www.youtube.com/AICoffeeBreak #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research Music 🎵 : Seratonin from the Sun – RAGE

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 59 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

This video explains how Transformers process sequences of fixed or variable length, discussing the role of positional embeddings, feed-forward layers, and attention mechanisms. It highlights that while efficient implementations may fix the length, in principle, Transformers can handle variable lengths.

Key Takeaways

Understand the basics of Transformer architecture
Learn about positional embeddings and their role in sequence processing
Explore how feed-forward layers and attention mechanisms work in Transformers
Consider the implications of variable sequence length on Transformer efficiency and implementation

💡 Transformers can process sequences of variable length in principle, but efficient implementations often fix the length to a maximum for computational efficiency.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits

Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks

Dev.to · 龚旭东

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

A simple way to test model fallbacks with RouterBase

Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface

Dev.to · routerbasecom

Chapters (3)

The quiz question

1:26 The code

3:11 The theory

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)