Reformer by Han Lee
Key Takeaways
The Reformer model, a reversible transformer, is optimized for memory efficiency and uses a Hash-based mechanism to reduce the complexity of the attention mechanism, with tools such as Heston, Hashshashin, Locality-sensitive hashing, and PyTorch being utilized.
Full Transcript
all right everyone my name is Hong and today I'm gonna talk about reformers which is a short for reversible transformer it is an optimized version of the famous transformer from attention is all you need paper from 2017 and put your business for you so here so before we dive deep into reformer here is the model a teacher a potential for transformer so the transformer this is the base architecture used by many state-of-the-art models such as bird GPT to excel nan etc so transformer follows the similar encoder/decoder architecture for Nero language models so on the left hand side here we have the encoder and it takes a position in positionally encoded in input embeddings and pass it through multi-head attention later and then go through a fee for network and the output is used for the decoder part together with the expected or your your trend data set shifted one to the right word however however in many units to the right and go through all and go through the decoder part and you will learn the output probabilities of what orders they would expect so the usually the default transformer is stack here and is six times so there is 6 repeating module of this and the multi-head attention is that 8 times so they can parallel process a lot more information but there is several problem with the transformer motto the motto is notoriously memory intensive and time-consuming to Train so for example when I implemented a transformer for deeper crease es 1 to 8 assignment I could only use a batch size of 48 for a very small boat capsized of only 10,000 words before running before get no tensorflow throw and be like your memory on the final line of memory errors and the two primary reason is that transformers there are two primary reasons for the memory intensiveness so first is this month ahead attention part and the second one is during the FIFO Network port so that's that see how the reformer authors apply computer science skills or comfort of science discipline into deep learning so transformer it uses a scale dot attention to find the best value for a query so here I don't think this is quite right because I don't think this is like dot attention it doesn't look like they thought attention but so e so e using my non academic interpretation the bigger the value of a so usually when people to talk attention you have two vectors and the bigger the value of the dot product the closer the two vectors are together right so basically you're trying to find the best vector for the best value for the query that you put it so which means that thought product attention is trying to look in look up the embedding that has the highest value given the input so in terms of computer science that's think in terms of a are huge lookup table that you have give you a query and the dot you go through everything go on and you find the best value of the big lookup table right so that's that's how many works so what's this so now as with all the ideas LICO or computer science interviews after you have algorithm you have to do a compelling complexity analysis so what's the complexity of using of the transformer model so in the paper they outline basically this this part is the equation for the dot scale product attention attention so this one is sort of you can ignore that because there's justify mention of the K but so here you have Q of this L and you have x dr K so this is Q vector and this is K vector those are the L you get a big fat metrics so your space complexity is your dense squared and your time complexity is also meant squared so in the paper they provided in an example that the sequence tens of 64 K it's going to produce a 64,000 by 64,000 metrics and 32 bits that's going to be 16 gig of memory so obviously that is why it is very memory intensive to train but since this is essentially a lookup table the reformer author is trying to approach this using Heston a specialty Hashshashin mechanism to solve this problem or to reduce the complexity so in the paper it has this fancy diagram talking about how they use the cavity sensitive hashing with chunking and sorting to reduce the complexity of the dot product attention it's all in the paper but since this is I guess my presentation so I will provide my own interpretation so for example here we have four vectors V 1 V 0 1 2 3 and we have 4 buckets right so let's say that's take the first 2 bit as their locality or how close they are so 1 1 will go here this one goes here 0 1 1 0 goes here and 0 0 goes here so here you have V 1 V 3 and then here you have V 0 and in the 0 0 bucket of feet you so that's how you sort or how you hash things into different buckets using the penalty census with hashing and was this you might notice right away that there's a problem a head the bucket size is imbalanced right because you have 2 here 1 0 and 1 here so how do you solve that you use chunking so basically you chunk to make sure every bucket are the same size so sometimes you have overflow from other buckets to on the from a previous bucket to the next one and that is going to help you help to operate and solve the problem and chunking my sound pretty fancy and in reality if you look at the code is this directly from the reformer code it is basically just reshaping uh-huh it's just a reshaping function to make sure they are the same size and how so since this is an language model so how do we compare how close they are together so previously omission that when you want to compare two vectors how close they are together you take a dot product right you take if you want to find the smallest angle cosine similarity between the two vectors so they have this fancy diagram but here I'm gonna try out as to how I understand it so basically if you can think you can try to think of like all the word vectors you have here all the other vectors you have here where you have a lot of vectors and basically what they do is they just throw two lies and they put this is a spotty one this is funky two lucky three and this is you for this is why they want to this is the reason why they use a angular locality sensitive hashing because in usually Indy Bernie we use cosine distances to measure how close they are together for this angle right because the e you want this angle to be small so they know the vectors are closer together and they do this multiple times so you so they can achieve the best separation between other vectors so they stem to the rotation that's where the where this goes they do a render rotation simple that render rotation simple that and Brandon returns simple that to make sure they get it best separation allows different vectors and get a best buddy sighs you know etc and the code is actually not in the reformer repo but in tracks layers research and since everyone is staying at home they have two versions of the locality-sensitive hashing so here on the right there is some simple rotation code every Wednesday at home so feel free to read the code and knock yourself up from here and so what does it help how much does it help so basically it so this is the equation they have on the paper but in summary and reduce the time and space complexity from O and squared because you have two big things multiplied together in two and lock n because now you are searching for it very specific for much smaller budgets so that's going to shrink down the size of your search by quite a lot and now let's look at the second pen point of the transformer model which is the caching and a fee for one network so we call in like plain vanilla of fine layers we used some time in a programming and caching techniques to store the activations for the backward pass to make a calculator so the author used an idea that came from a paper called reversible residual network which is published in 2017 in that paper the authors design a network which activates which activation caching is no longer required so basically the reversible net uses two inputs and two outputs instead of just one single input and one single output and they zip that and so the backward path is going to be a lot faster and doesn't and a lot more simple to calculate and here is a short snippet from the reformer code so here you can see they duplicate the input they swap it and they go to output and this part is the best part is the actual on the highway mayor layers so they have a block called preattention a block called attention and block called post attention oh I'm running out of space and you go through here here here and then you go here and then you add together at the end so that how they this this is how they rotate and they just foot for the means on the backward pass during training face so overall their time complexity analysis from before and after is laid out in the paper as this but I mean it's really quite it looks quite complex but the only thing one of the only thing but the most important thing you have to you can note is here you have a square term right here and at this point that it's getting to that got reduced to the number of chunks in the locality sensitive hashing that the trick that use so overall it goes from oh and square to o log N so in summary the key takeaway for this presentation is transformers oh and squared reformer oh and lock n so it's so it reduces the time complexity and space complexity so you can fit a much bigger network in a set amount when your memory comes trained it's leaner it's faster and it supports either long longer input sequences or longer vocab size and in the paper so it for transformer the usually the max input token size token n is 500 cough and in the paper they they said something like 64,000 but I'm not you know a hundred percent sure about if that is a valid comparison but that so don't fool me on that but you can they give you a lot more more work space in the memory and they achieve it using first Angela locality sensitive hashing and second he uses reversible net on the fee for network part on the high weight how we pass around and best of all it is everything is written in tracks it is yet another research oriented the learning library that is that doesn't read the same as Sappho or pi torch so yeah I'm not sure why they want to torture like normal people with yet another research oriented code and library but that's that so here are the references the first is the reformer paper the second one is attention is all you need and the third one is the reversible residual Network paper and of course there's always DJ Alomar blog which is awesome it's graphical it's interactive and any questions all right so our first question is from money so money asked why would cosine distance be used instead of the Euclidean distance the cosine distance is the easy to calculate it reads really fast and easy to calculate the dot product of two matrices and you want to get you want to get how close they are together they want to get angle how how close they are together in the of the two vectors nah not the UH Manhattan or DA Euclidean distance of the two vectors that's usually the case for language models so like one of the good example is when people showcase this simple work to that model you have king and queen is similar to men and women something like that that that is measured by cosine distance of the two two were vectors so that's there's there's a lot of research view on top of that so that's that's how that propagate through great and another question first yeah another question from money you mentioned compression can you elaborate on it compression I don't recall clarify that a little bit more in the well we're waiting for that there's a question from when putting the vectors in the hash buckets why do we only consider the first two bits oh right so this one is this slide is not a particular it's not particular to how Reformers perform this slide is just to help you or help me understand how locality-sensitive hashing works because I want to showcase how you how you use certain measures in this case we're using the same thing putting in the different pockets right but in reality in for reformers they may use this because everything is so every every word has is so in a so in any better or everyone is a factor or whatever token is a vector and you want to compare right so that's why in reformers you they stash to lies and then they step so you have all the vectors in the in the space and you have two hyperplanes just fruit ninja and circuit separate in retreat in two different quadrants and you put them put different quadrant into a bucket over and over until it's a lot more balanced across different buckets so it's not particularly the first two bits so it was chose to demonstrate the idea of a message okay I'm not seeing any additional question so I'll just say on the point about measuring angles there's actually some really interesting work about how vectors and high dimensional spaces measured with cosine similarity essentially can be used to do representation of semantic concepts and in fact even to do during complete computation so I'll send a quick little link to the chat that gives a little bit of the background done on that idea it's older than the idea of these transformer networks and sort of key query networks but it's really done quite well with these new architectures yes Charles is the expert in all this steam I am just a plug trying to understand what you know what people do in different conferences trying to read the papers and explain it in my own voice thank you hon that was really really good I just want to add a note we're gonna drop all of the sites through the speakers are using because I saw like your site had a lot of really interesting links so we will drop it in the WNBA slack community I just posted the link for that again in the chat
Original Description
Reformer ia a REversible transFORMER that can't reverse the trend of acronyms in papers. Han Lee is a Machine Learning engineer who founded ncov19, an initiative to provide better information around COVID-19 to the public. He has previously worked at Ericsson and AMD.
👩🏼🚀Weights and Biases:
We’re always free for academics and open source projects. Email carey@wandb.com with any questions or feature suggestions.
- Blog: https://www.wandb.com/articles
- Gallery: See what you can create with W&B -https://app.wandb.ai/gallery
- Continue the conversation on our slack community - http://wandb.me/fs
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Weights & Biases · Weights & Biases · 49 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
▶
50
51
52
53
54
55
56
57
58
59
60
0. What is machine learning?
Weights & Biases
1. Build Your First Machine Learning Model
Weights & Biases
Intro to ML: Course Overview
Weights & Biases
2. Multi-Layer Perceptrons
Weights & Biases
3. Convolutional Neural Networks
Weights & Biases
Weights & Biases at OpenAI
Weights & Biases
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
4. Autoencoders
Weights & Biases
5. Sentiment Analysis
Weights & Biases
6. Recurrent Neural Networks [RNNs]
Weights & Biases
7. Text Generation using LSTMs and GRUs
Weights & Biases
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
Introducing Weights & Biases
Weights & Biases
10. Seq2Seq Models
Weights & Biases
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
14. Data Augmentation | Keras
Weights & Biases
15. Batch Size and Learning Rate in CNNs
Weights & Biases
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
Testing Machine Learning Models with Eric Schles
Weights & Biases
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
Reformer by Han Lee
Weights & Biases
Sequence Models with Pujaa Rajan
Weights & Biases
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
Made with ML - Goku Mohandas
Weights & Biases
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
DevOps Took 10 Years to Mature.
Medium · DevOps
Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI
Medium · DevOps
Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx
Dev.to · Shannon Dias
MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI