Reformer by Han Lee

Weights & Biases · Beginner ·🏭 MLOps & LLMOps ·6y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%RAG Basics60%Vector Stores50%

Key Takeaways

The Reformer model, a reversible transformer, is optimized for memory efficiency and uses a Hash-based mechanism to reduce the complexity of the attention mechanism, with tools such as Heston, Hashshashin, Locality-sensitive hashing, and PyTorch being utilized.

Full Transcript

all right everyone my name is Hong and today I'm gonna talk about reformers which is a short for reversible transformer it is an optimized version of the famous transformer from attention is all you need paper from 2017 and put your business for you so here so before we dive deep into reformer here is the model a teacher a potential for transformer so the transformer this is the base architecture used by many state-of-the-art models such as bird GPT to excel nan etc so transformer follows the similar encoder/decoder architecture for Nero language models so on the left hand side here we have the encoder and it takes a position in positionally encoded in input embeddings and pass it through multi-head attention later and then go through a fee for network and the output is used for the decoder part together with the expected or your your trend data set shifted one to the right word however however in many units to the right and go through all and go through the decoder part and you will learn the output probabilities of what orders they would expect so the usually the default transformer is stack here and is six times so there is 6 repeating module of this and the multi-head attention is that 8 times so they can parallel process a lot more information but there is several problem with the transformer motto the motto is notoriously memory intensive and time-consuming to Train so for example when I implemented a transformer for deeper crease es 1 to 8 assignment I could only use a batch size of 48 for a very small boat capsized of only 10,000 words before running before get no tensorflow throw and be like your memory on the final line of memory errors and the two primary reason is that transformers there are two primary reasons for the memory intensiveness so first is this month ahead attention part and the second one is during the FIFO Network port so that's that see how the reformer authors apply computer science skills or comfort of science discipline into deep learning so transformer it uses a scale dot attention to find the best value for a query so here I don't think this is quite right because I don't think this is like dot attention it doesn't look like they thought attention but so e so e using my non academic interpretation the bigger the value of a so usually when people to talk attention you have two vectors and the bigger the value of the dot product the closer the two vectors are together right so basically you're trying to find the best vector for the best value for the query that you put it so which means that thought product attention is trying to look in look up the embedding that has the highest value given the input so in terms of computer science that's think in terms of a are huge lookup table that you have give you a query and the dot you go through everything go on and you find the best value of the big lookup table right so that's that's how many works so what's this so now as with all the ideas LICO or computer science interviews after you have algorithm you have to do a compelling complexity analysis so what's the complexity of using of the transformer model so in the paper they outline basically this this part is the equation for the dot scale product attention attention so this one is sort of you can ignore that because there's justify mention of the K but so here you have Q of this L and you have x dr K so this is Q vector and this is K vector those are the L you get a big fat metrics so your space complexity is your dense squared and your time complexity is also meant squared so in the paper they provided in an example that the sequence tens of 64 K it's going to produce a 64,000 by 64,000 metrics and 32 bits that's going to be 16 gig of memory so obviously that is why it is very memory intensive to train but since this is essentially a lookup table the reformer author is trying to approach this using Heston a specialty Hashshashin mechanism to solve this problem or to reduce the complexity so in the paper it has this fancy diagram talking about how they use the cavity sensitive hashing with chunking and sorting to reduce the complexity of the dot product attention it's all in the paper but since this is I guess my presentation so I will provide my own interpretation so for example here we have four vectors V 1 V 0 1 2 3 and we have 4 buckets right so let's say that's take the first 2 bit as their locality or how close they are so 1 1 will go here this one goes here 0 1 1 0 goes here and 0 0 goes here so here you have V 1 V 3 and then here you have V 0 and in the 0 0 bucket of feet you so that's how you sort or how you hash things into different buckets using the penalty census with hashing and was this you might notice right away that there's a problem a head the bucket size is imbalanced right because you have 2 here 1 0 and 1 here so how do you solve that you use chunking so basically you chunk to make sure every bucket are the same size so sometimes you have overflow from other buckets to on the from a previous bucket to the next one and that is going to help you help to operate and solve the problem and chunking my sound pretty fancy and in reality if you look at the code is this directly from the reformer code it is basically just reshaping uh-huh it's just a reshaping function to make sure they are the same size and how so since this is an language model so how do we compare how close they are together so previously omission that when you want to compare two vectors how close they are together you take a dot product right you take if you want to find the smallest angle cosine similarity between the two vectors so they have this fancy diagram but here I'm gonna try out as to how I understand it so basically if you can think you can try to think of like all the word vectors you have here all the other vectors you have here where you have a lot of vectors and basically what they do is they just throw two lies and they put this is a spotty one this is funky two lucky three and this is you for this is why they want to this is the reason why they use a angular locality sensitive hashing because in usually Indy Bernie we use cosine distances to measure how close they are together for this angle right because the e you want this angle to be small so they know the vectors are closer together and they do this multiple times so you so they can achieve the best separation between other vectors so they stem to the rotation that's where the where this goes they do a render rotation simple that render rotation simple that and Brandon returns simple that to make sure they get it best separation allows different vectors and get a best buddy sighs you know etc and the code is actually not in the reformer repo but in tracks layers research and since everyone is staying at home they have two versions of the locality-sensitive hashing so here on the right there is some simple rotation code every Wednesday at home so feel free to read the code and knock yourself up from here and so what does it help how much does it help so basically it so this is the equation they have on the paper but in summary and reduce the time and space complexity from O and squared because you have two big things multiplied together in two and lock n because now you are searching for it very specific for much smaller budgets so that's going to shrink down the size of your search by quite a lot and now let's look at the second pen point of the transformer model which is the caching and a fee for one network so we call in like plain vanilla of fine layers we used some time in a programming and caching techniques to store the activations for the backward pass to make a calculator so the author used an idea that came from a paper called reversible residual network which is published in 2017 in that paper the authors design a network which activates which activation caching is no longer required so basically the reversible net uses two inputs and two outputs instead of just one single input and one single output and they zip that and so the backward path is going to be a lot faster and doesn't and a lot more simple to calculate and here is a short snippet from the reformer code so here you can see they duplicate the input they swap it and they go to output and this part is the best part is the actual on the highway mayor layers so they have a block called preattention a block called attention and block called post attention oh I'm running out of space and you go through here here here and then you go here and then you add together at the end so that how they this this is how they rotate and they just foot for the means on the backward pass during training face so overall their time complexity analysis from before and after is laid out in the paper as this but I mean it's really quite it looks quite complex but the only thing one of the only thing but the most important thing you have to you can note is here you have a square term right here and at this point that it's getting to that got reduced to the number of chunks in the locality sensitive hashing that the trick that use so overall it goes from oh and square to o log N so in summary the key takeaway for this presentation is transformers oh and squared reformer oh and lock n so it's so it reduces the time complexity and space complexity so you can fit a much bigger network in a set amount when your memory comes trained it's leaner it's faster and it supports either long longer input sequences or longer vocab size and in the paper so it for transformer the usually the max input token size token n is 500 cough and in the paper they they said something like 64,000 but I'm not you know a hundred percent sure about if that is a valid comparison but that so don't fool me on that but you can they give you a lot more more work space in the memory and they achieve it using first Angela locality sensitive hashing and second he uses reversible net on the fee for network part on the high weight how we pass around and best of all it is everything is written in tracks it is yet another research oriented the learning library that is that doesn't read the same as Sappho or pi torch so yeah I'm not sure why they want to torture like normal people with yet another research oriented code and library but that's that so here are the references the first is the reformer paper the second one is attention is all you need and the third one is the reversible residual Network paper and of course there's always DJ Alomar blog which is awesome it's graphical it's interactive and any questions all right so our first question is from money so money asked why would cosine distance be used instead of the Euclidean distance the cosine distance is the easy to calculate it reads really fast and easy to calculate the dot product of two matrices and you want to get you want to get how close they are together they want to get angle how how close they are together in the of the two vectors nah not the UH Manhattan or DA Euclidean distance of the two vectors that's usually the case for language models so like one of the good example is when people showcase this simple work to that model you have king and queen is similar to men and women something like that that that is measured by cosine distance of the two two were vectors so that's there's there's a lot of research view on top of that so that's that's how that propagate through great and another question first yeah another question from money you mentioned compression can you elaborate on it compression I don't recall clarify that a little bit more in the well we're waiting for that there's a question from when putting the vectors in the hash buckets why do we only consider the first two bits oh right so this one is this slide is not a particular it's not particular to how Reformers perform this slide is just to help you or help me understand how locality-sensitive hashing works because I want to showcase how you how you use certain measures in this case we're using the same thing putting in the different pockets right but in reality in for reformers they may use this because everything is so every every word has is so in a so in any better or everyone is a factor or whatever token is a vector and you want to compare right so that's why in reformers you they stash to lies and then they step so you have all the vectors in the in the space and you have two hyperplanes just fruit ninja and circuit separate in retreat in two different quadrants and you put them put different quadrant into a bucket over and over until it's a lot more balanced across different buckets so it's not particularly the first two bits so it was chose to demonstrate the idea of a message okay I'm not seeing any additional question so I'll just say on the point about measuring angles there's actually some really interesting work about how vectors and high dimensional spaces measured with cosine similarity essentially can be used to do representation of semantic concepts and in fact even to do during complete computation so I'll send a quick little link to the chat that gives a little bit of the background done on that idea it's older than the idea of these transformer networks and sort of key query networks but it's really done quite well with these new architectures yes Charles is the expert in all this steam I am just a plug trying to understand what you know what people do in different conferences trying to read the papers and explain it in my own voice thank you hon that was really really good I just want to add a note we're gonna drop all of the sites through the speakers are using because I saw like your site had a lot of really interesting links so we will drop it in the WNBA slack community I just posted the link for that again in the chat

Original Description

Reformer ia a REversible transFORMER that can't reverse the trend of acronyms in papers. Han Lee is a Machine Learning engineer who founded ncov19, an initiative to provide better information around COVID-19 to the public. He has previously worked at Ericsson and AMD. 👩🏼‍🚀Weights and Biases: We’re always free for academics and open source projects. Email carey@wandb.com with any questions or feature suggestions. - Blog: https://www.wandb.com/articles - Gallery: See what you can create with W&B -https://app.wandb.ai/gallery - Continue the conversation on our slack community - http://wandb.me/fs

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 49 of 60

← Previous Next →

0. What is machine learning?

0. What is machine learning?

Weights & Biases

1. Build Your First Machine Learning Model

1. Build Your First Machine Learning Model

Weights & Biases

Intro to ML: Course Overview

Intro to ML: Course Overview

Weights & Biases

2. Multi-Layer Perceptrons

2. Multi-Layer Perceptrons

Weights & Biases

3. Convolutional Neural Networks

3. Convolutional Neural Networks

Weights & Biases

Weights & Biases at OpenAI

Weights & Biases at OpenAI

Weights & Biases

Why Experiment Tracking is Crucial to OpenAI

Why Experiment Tracking is Crucial to OpenAI

Weights & Biases

4. Autoencoders

4. Autoencoders

Weights & Biases

5. Sentiment Analysis

5. Sentiment Analysis

Weights & Biases

6. Recurrent Neural Networks [RNNs]

6. Recurrent Neural Networks [RNNs]

Weights & Biases

7. Text Generation using LSTMs and GRUs

7. Text Generation using LSTMs and GRUs

Weights & Biases

8. Text Classification Using Convolutional Neural Networks

8. Text Classification Using Convolutional Neural Networks

Weights & Biases

9. Hybrid LSTMs [Long Short-Term Memory]

9. Hybrid LSTMs [Long Short-Term Memory]

Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Weights & Biases

Weights and Biases - Developer Tools for Deep Learning

Weights and Biases - Developer Tools for Deep Learning

Weights & Biases

Introducing Weights & Biases

Introducing Weights & Biases

Weights & Biases

10. Seq2Seq Models

10. Seq2Seq Models

Weights & Biases

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

Weights & Biases

12. One-shot learning for teaching neural networks to classify objects never seen before

12. One-shot learning for teaching neural networks to classify objects never seen before

Weights & Biases

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases

14. Data Augmentation | Keras

14. Data Augmentation | Keras

Weights & Biases

15. Batch Size and Learning Rate in CNNs

15. Batch Size and Learning Rate in CNNs

Weights & Biases

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Weights & Biases

Grading Rubric for AI Applications with Sergey Karayev (2019)

Grading Rubric for AI Applications with Sergey Karayev (2019)

Weights & Biases

16. Video Frame Prediction using CNNs and LSTMs (2019)

16. Video Frame Prediction using CNNs and LSTMs (2019)

Weights & Biases

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Weights & Biases

17. Build and Deploy an Emotion Classifier (2019)

17. Build and Deploy an Emotion Classifier (2019)

Weights & Biases

Applied Deep Learning - Data Management with Josh Tobin (2019)

Applied Deep Learning - Data Management with Josh Tobin (2019)

Weights & Biases

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Weights & Biases

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Weights & Biases

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Weights & Biases

Designing a Machine Learning Project with Neal Khosla (2019)

Designing a Machine Learning Project with Neal Khosla (2019)

Weights & Biases

Lukas Beiwald on ML Tools and Experiment Management (2019)

Lukas Beiwald on ML Tools and Experiment Management (2019)

Weights & Biases

Building Machine Learning Teams with Josh Tobin (2019)

Building Machine Learning Teams with Josh Tobin (2019)

Weights & Biases

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Weights & Biases

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Weights & Biases

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Weights & Biases

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Weights & Biases

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Weights & Biases

Organizing ML projects — W&B walkthrough (2020)

Organizing ML projects — W&B walkthrough (2020)

Weights & Biases

Brandon Rohrer — Machine Learning in Production for Robots

Brandon Rohrer — Machine Learning in Production for Robots

Weights & Biases

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Weights & Biases

My experiments with Reinforcement Learning with Jariullah Safi

My experiments with Reinforcement Learning with Jariullah Safi

Weights & Biases

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Weights & Biases

Testing Machine Learning Models with Eric Schles

Testing Machine Learning Models with Eric Schles

Weights & Biases

How Linear Algebra is not like Algebra with Charles Frye

How Linear Algebra is not like Algebra with Charles Frye

Weights & Biases

Predicting Protein Structures using Deep Learning with Jonathan King

Predicting Protein Structures using Deep Learning with Jonathan King

Weights & Biases

Rachael Tatman — Conversational AI and Linguistics

Rachael Tatman — Conversational AI and Linguistics

Weights & Biases

Reformer by Han Lee

Reformer by Han Lee

Weights & Biases

Sequence Models with Pujaa Rajan

Sequence Models with Pujaa Rajan

Weights & Biases

GitHub Actions & Machine Learning Workflows with Hamel Husain

GitHub Actions & Machine Learning Workflows with Hamel Husain

Weights & Biases

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Weights & Biases

Jack Clark — Building Trustworthy AI Systems

Jack Clark — Building Trustworthy AI Systems

Weights & Biases

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Weights & Biases

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Weights & Biases

Antipatterns in open source research code with Jariullah Safi

Antipatterns in open source research code with Jariullah Safi

Weights & Biases

Attention for time series forecasting & COVID predictions - Isaac Godfried

Attention for time series forecasting & COVID predictions - Isaac Godfried

Weights & Biases

Made with ML - Goku Mohandas

Made with ML - Goku Mohandas

Weights & Biases

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Weights & Biases

Deep Learning Salon by Weights & Biases

Deep Learning Salon by Weights & Biases

Weights & Biases

The Reformer model is a reversible transformer that optimizes memory efficiency using a Hash-based mechanism, and can be used for fine-tuning and retrieval augmented generation tasks. This model is useful for reducing the complexity of attention mechanisms and improving model performance. By using tools such as Heston, Hashshashin, and PyTorch, developers can implement and optimize Reformer models for specific tasks.

Key Takeaways

Implement a Reformer model using PyTorch
Apply a Hash-based mechanism to reduce attention mechanism complexity
Use locality-sensitive hashing for efficient data retrieval
Fine-tune the model for specific tasks
Optimize model performance using reversible residual networks

💡 The Reformer model's use of a Hash-based mechanism and locality-sensitive hashing enables efficient data retrieval and reduces the complexity of attention mechanisms, making it a useful tool for fine-tuning and retrieval augmented generation tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

DevOps Took 10 Years to Mature.

MLOps is distinct from DevOps and solves unique problems, requiring a different approach

Medium · DevOps

Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI

Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance

Medium · DevOps

Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx

Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development

Dev.to · Shannon Dias

MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages

Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation

Pole Pruner How A Rope Lever Shears High Branches

Innoforge Studio