PostLN, PreLN and ResiDual Transformers

Machine Learning Studio · Advanced ·🧠 Large Language Models ·2y ago

About this lesson

PostLN Transformers suffer from unbalanced gradients, leading to unstable training due to vanishing or exploding gradients. Using a learning-rate Warmup stage is considered as a practical solution, but that also requires running more hyper-parameters, making the Transformers training more difficult. In this video, we will look at some alternatives to the PostLN Transformers, including PreLN Transformer, and the ResiDual, a Transformer with Double Residual Connections. References: 1. "On Layer Normalization in the Transformer Architecture", Xiong et al., (2020) 2. "Understanding the Difficulty of Training Transformers", Liu et al., (2020) 3. "ResiDual: Transformer with Dual Residual Connections", Xie et al., (2023) 4. "Learning Deep Transformer Models for Machine Translation", Wang et al., (2019)

Original Description

PostLN Transformers suffer from unbalanced gradients, leading to unstable training due to vanishing or exploding gradients. Using a learning-rate Warmup stage is considered as a practical solution, but that also requires running more hyper-parameters, making the Transformers training more difficult. In this video, we will look at some alternatives to the PostLN Transformers, including PreLN Transformer, and the ResiDual, a Transformer with Double Residual Connections. References: 1. "On Layer Normalization in the Transformer Architecture", Xiong et al., (2020) 2. "Understanding the Difficulty of Training Transformers", Liu et al., (2020) 3. "ResiDual: Transformer with Dual Residual Connections", Xie et al., (2023) 4. "Learning Deep Transformer Models for Machine Translation", Wang et al., (2019)
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance
Medium · LLM
A simple way to test model fallbacks with RouterBase
Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface
Dev.to · routerbasecom
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →