MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·5y ago

Skills: Multimodal LLMs80%CV Basics70%Modern CV Models70%Fine-tuning LLMs60%

Key Takeaways

The video explains the MLP-Mixer paper, which introduces an all-MLP architecture for vision tasks, achieving similar results to transformers and CNNs while being more computationally efficient. The architecture uses token mixing and channel mixing MLPs, and is pre-trained on large datasets such as ImageNet and JFT 300 million dataset.

Full Transcript

what's up in this video i'm covering the mlp mixer and all mlp architecture for vision a newly published paper by the google brain team and i'd give it an alternative title which is mlp is all you need because as you'll see scene uh we went to full circle and we we we used to use uh mlps the multi-layer perceptrons to to solve the vision tasks and we in 2012 we had the imagenet moment with alexnet which basically um showed that it's much wiser to just use a big convolutional neural networks with a lot of data and compute to to solve the vision tasks finally we started using transformers we started with the vassani transformer in 2017 and then in the last year we had vision transformer which showed that with lots of data and used like the jft 300 dam data set from from from google the proprietary data set and with 300 million data points it showed that it can learn really nice representations and even achieve better results than cnns so finally back to 2021 uh this paper came out and showed that by just using these simple multi-layer perceptrons you know in a clever way arguably you can achieve similar results and compute wise you can you can be even a bit better and throughput wise as well so yeah let's see so so basically the the main thing is this paper doesn't try to achieve the new soda it just shows that we need to maybe investigate some other research paths uh which may lead us even further so this paper is a good step towards that direction so they showed that with simplicity they can still achieve great results so who knows a couple of papers down the line what this may evolve to so having said that let's let's dig into the paper so first things first they say here the mixture relies only on basic matrix multiplication routines changes to data layouts such as reshapes and transpositions and scalar non-linearities such as values as we'll soon see so that part associates me to bitter lessons by sudden which is a blog you should check out by richard sutton which basically and i'll link it down in the description but it basically says that uh the the model that stand the tooth of time are those which can leverage which are simple and can leverage the available computation of the time and so this paper is super simple and you can see they even they even linked uh like a code snippet uh implementation of this paper at the last page which kind of tells you all about it so it's basically 20 liner and uh that's awesome so that's simple and achieves nice results so yeah i like it that's the part i like about this paper the architecture wise so they just have these channel mixing mlps and token mixing lps so the token mixing mlp is basically attend across the the image so the spatial extent they attend the tokens the patches and the channel mixing ones just focus on on a single patch and they kind of process the channels hence the mixer name so they are mixing between processing channels and processing that over the spatial extent ie over the tokens or or patches so here is the architecture fairly simple uh resembles the vision transformer a lot so if you haven't checked out i've covered the vision transformer in my previous video i'll just link it somewhere here the architecture is very similar to the vision transformer so you just take the input image you split it into patches as you can see here you you flatten that out as you can see here and then you just unroll all of these matrices into vectors and you use a single fully connected layer here which is shared across all of these all of these patches and you just project them into some new latent representation vectors here then they stack end of these uh mixer layers which we'll soon see what they are but like uh they ended up with global average pooling and a classifier a linear classifier on top of it so it's a super simple architecture mixer contains uh the the part that mixes so that this is a token token mixer part this year and this is the channel mixing part this part here so uh before digging into those so just let's see what mlp is mlp just consists out of two fully connected layers and a galileo non-linearity in between okay so super simple if we ignore the transpositions here what they semantically do here is they just take the column here and they apply uh the first mlp the mlp1 they applied on this column so they'll project this column into some other subspace and they'll just take the same mlp uh take the second column project that one into another space and basically you you arrive at the at a new at a new subspace here which has the same dimension so it's again it's c and s here as it was here so they preserve the dimensions and that's the legacy from the transformer architecture so basically if you take a look at the cnns on the other hand you usually have the spatial extent usually goes down you have this pyramid-like structure where the spatial extent goes down where the number of channels increases usually twofold that was at least some heuristic people used to use doesn't mean it's the best one and here we're preserving dimensions so the second thing they do is again ignoring uh layer normalization and skip connections they just take the row wise they just apply this mlp2 uh network to like robots onto this matrix and they achieve some new representation and they just repeat this end times and that's it that's as simple as that uh so skip connections are just uh like the the resonant paper introduced those and the transformer used those as well as well as the layer norm so the original was fine transformer used layer norm so i guess a bunch of those are just legacy artifacts uh what's new here is they are not using cnn's they're not using the convolutional kernels they're not using the attention they're just using mlp they took some of the legacy from from the previous art but yeah and they they showed as you will soon see that they achieve comparable comparable results and yeah that's it here despite despite its simplicity mixer attains competitive results when pre-trained on large data sets so 300 million actually it reaches near state-of-the-art performance previously claimed by cnns and transformers in terms of the accuracy cost trade-offs okay uh yeah maybe a small rant here uh it's unclear from the paper why they're using values so it seems we we're we keep we keep using like these exotic uh uh activation functions without having a good reason why like for their for their uh defense they have on the on one of the last papers they do have the things they've tried and that didn't work and which is really super and great uh but i'd like to know why they use galleos and why not use value i think those kinds of insights would be really useful but what probably happened is they tried a couple a couple of these and they just took the best one just a simple research over activation functions but like more probably they just took it because some other people used it and that's it um but yeah second thing is i mean we keep on using these uh normalizations again without any particular reason as far right at least i don't see it so like why are we using layer norm and not batch norm or instance normalization or group norm um like there is too much tradition and legacy and taking something that worked and there is too little theoretical explanations of why we're actually using something that we're using and that's why we get papers like this where after years of developing of cnns and and transformers we showed that hey we can we can just use mlps it's going to work very complicating things so yeah a small rent uh rent over let's continue with the paper um and i'll skip some of these details for now i just want to give you the bigger picture of the model for now so uh first things first uh similar to the vision transformer they have uh they create a family of models so they have the small mixers with 32 means basically the patches are 32 by 32 pixels so that means you'll have less of those patches because the patch is bigger and then they have 16 here so the big models large models and finally the huge mixer uh and yeah just a bunch of different parameters depending on so basically as you go towards a huge model you increase everything uh and that's it again skipping the details and focusing on the results so as you can see when it's pre-trained and this is the main takeaway when it's pre-trained on smaller data sets it's not going to have as good performance as the better model as the more common cnns and transformers such as the v8 the big transfer model the vision transformer so as you can see it's lagging behind a little bit here compared to viet compared to bit and vit but this part actually i highlighted this part because it's actually a bit better a lot better on vtep which i think is more complex actually than than the other ones so that's kind of surprising i'd like to know what this number is all about um and yeah you can see the throughput is a bit better than than the vision transformer the compute is a bit higher so it's hard to compare them but um it's definitely not a new state of the art it's just comparable you can see the perth wise it's there uh throughput is a bit better but then compute wise it's even it takes more compute than the vision transformer which is interesting and yeah but going to jft 300m you can see that the the now we have even better performance than the big transfer model and it's comparable to vision transfer and if nets are decently better it's lagging behind nfnets both in terms of performance both in terms of throughput and yeah but this nft net takes a bit more compute i'd like to see comparison with the efficient nav v2 which recently was which was recently published and um which showed that it's much better than nf nets so yeah i wonder why they omitted that part probably because they were already late into the paper writing but yeah i'd like to see the comparison with with in that 2v2 so those were some tabular results now let's see the charts these are really interesting we have the chart that shows the compute accuracy trade-off here and you can see that the mixer is directly on the pareto frontier and what pareto frontier is is just a fancy word basically that means that if you're at this point you cannot increase one metric of interest without decreasing the other one so that means you can do this because that means oops that means you're keeping the accuracy constant but you're decreasing the compute or likewise you can't keep the compute and just increase the accuracy you have to move across these lines where you if you want to increase the accuracy you have to increase the compute or if you want to decrease the compute you have to decrease the accuracy and that's just a trade-off and a good thing and a really reassuring fact is that it's lying directly on that frontier alongside with nf nets and alongside with the vision transformer so those are some nice results the second chart shows us that as we increase the compute so going from 10 million data points sourced the full data set jft 300m we see that basically and the full lines are the mixer models the other ones are the big transfer and the vision transformer so we can see and what's interesting is that for for smaller models for small smaller mixer models uh it plateaus really quick and it achieves results maybe a bit inferior to to other multi to those baselines but like when we go to bigger mixer models and large data regimes we can see that it even achieves better performance at finally uh like compared to to this is i think yeah visual transformer so so the encouraging thing is that uh the derivative here is positive mean which means if we extrapolate uh hopefully it's going to go towards the agi uh that's that's the promise of this model obviously yeah and uh jokes aside um it does seem to to to have um like higher steeper like slope here than compared to the vision transformer which is encouraging which means that we can push this even further so two papers down the line we'll be seeing even bigger data sets and yeah better performance and i haven't mentioned this part and it's just an implementation detail the reason they use this linear five shot image net top one instead of just using image in a top one is because it was really computer intensive even for google brain to train all of these to fine-tune all these models so as a proxy so they've just frozen the weights that come from the from the images and they trained a small linear classifier on top of it and used that as a proxy to fine-tuning but just an implementation detail as i said the main thing take away here is that it seems that mlps have even greater potential than vision transformers and that's exciting um two more curves here again compute versus accuracy you can see mixer is pretty decent on this part of the spectrum so when we have a bunch of compute it's pretty decent and comparable to the vision transformer as we go on the lower part of the spectrum of the compute spectrum it seems to lag a little bit behind the vision transformer but yeah it kind of converges here and we saw the similar behavior with the vision transformer compared to cnn's only in the big data regimes we get really high performance out of it similarly here throughput and accuracy it's on the frontier here they just have some additional tabular data but all of the main takeaways were pretty much in those charts i already described um regarding compute they said here we may scale the model in two independent ways so increasing the model size the number of layers blah blah blah uh and during the pre-training and the second dimension is increasing the input image resolution when fine-tuning so those are the two things they had as in in their in their toolkit and as i said it appears that mixer models benefit from growing pre-training data set size even more than the vision transformer nice finally nice visualizations here uh and that's the end of the paper basically first of all let me explain you what these patches represent so if you you have the input image and uh let's draw it like this and it's got some patches i'll draw like four by four but it's usually much more it's usually 14 by 14 or something like that and what it does so if you take this single patch what it does is basically imagine we have 14 by 14 of these patches and so this thing here basically means that if we take this and flatten it out like this so we have s here and c here and that was usual representation we used in the beginning of the paper so this will be 14 by 14 and the number of channels doesn't matter now so basically this thing here is the uh weights which attend to all of the elements in this column and if you remember so these are the tokens so that means a single pixel here will attend to a single patch here so that means if you have something like this let me zoom in a bit something like this that means uh that that particular fully connected layer is going to attend a lot to this part because the red let's assume red is some positive number so it's going to have to to attend this part of the image with the positive weights and it's going to attend this part of the image with the negative weights hence the blue part similarly here you can see that the blue part that means we're going to attend this part with negative weights and this part with the positive weight so and another interesting detail is that you can see there is a lot of symmetry here and they've intentionally uh arranged all of these like that and also if you take a look at the y-axis the frequency kind of goes up so these are the low frequency components this is like the dc component and then we have the high frequency components here by the way this is the the first mlp layer this is the second token mixer layer this is the third token mixer layer so as we go here deeper into the network we can see that the patterns become much more complex higher frequency ones than here and the idea why they did this is because usually when we analyze cnns we can notice certain patterns and they mention the gaba filters which is just a multi you get those by a combination of gaussians and sinusoids and we see similar structure in the lower layers but when we go deeper into network we we get something that we still can't discern and it'd be nice to kind of analyze this and do some mathematical approximations here to understand uh how we can model these and maybe we can take off take some some intuition further from this and develop device better models once we know what this thing is modeling so yeah but it's a i guess future work for these authors now let me go ahead and explain a couple of those details which may be important uh to you so first of all let me start with this uh so yeah so i mentioned that they are sharing weights so all patches are linearly projected with the same projection matrix and that helps saves up the memory footprint so that's important and they are also sharing the across the columns and across the rows so that's the um the mixer the the token mixing part and the the channel mixing part and let me just decrypt these equations for you so basically we take that input matrix which has dimensions s times c if you remember and they applied normalization and this weight matrix just represents the first fully connected layer of the mlp so that's this one here that's the network and then they have a non-linearity which is galio so that's this one finally they have the second weight matrix which represents the second fully connected layer so that's again this one here and finally we have plus with the input which is basically the skip connection and then they just repeat this and apply the the second thing is basically the so this is the this is across channels um and this is across so that means it's a token mixer and this part is across uh basically so let me draw the matrix here so this is s this is c this goes across channels so that means it's attending a bunch of different tokens so it's a token mixer and here we go across s so that means we're doing channel wise mixing so these are the two equations uh i wanted to explain okay uh they mentioned that tying across the channel mixing mlp so so keeping the same network here and here and here makes sense because you're basically you want to do the positioning that encodes the that kind of enforces the positioning variance because whatever you learn here you want to have a pattern that's generalizable to all of the other spatial locations but you usually don't constrain you usually don't take the network like the mlp and constrain it to be the same across the columns so that's what they mentioned and they actually tried both things they figured out that just tying it doesn't hurt the performance but it saves memory obviously because you don't have to have c mlps you just can't you can have just one single mmp and just apply it three times um but again there's just uh i think like there is nothing theoretically guaranteeing that this is a better choice they just it's just a experiment um important detail is that it's not using so mixer does not use position embeddings because the token mixing mlps are sensitive to the order of the input tokens and therefore may learn to represent location so um this is a direct um uh compare let's make a direct comparison with the transformer so transformers have as you may recall uh so these are the patches uh embedded into the latent space and what the transformer does is let's take this token for example it's going to attend all of the other tokens so we're gonna create those q keys uh queries and value vectors and we're going to attend all of the other tokens and we're going to form those uh alpha coefficients basically the attention coefficients and we're going to sum them up and because we're summing them up we are losing the positional information and because of that uh the original devastating paper and some and the successor papers basically had to add additional positional information so that's kind of going to encode somehow that this is different so maybe like this and then we'll maybe have the second position encoded like this whatever you need some unique pattern that kind of uniquely identifies each of the positions but here because we have mlps we don't need to do that and the reason is if you take a look at the matrix we have s here and c here so if you have mlp like the token token wise mixer applied across this column you basically because it's a it's an mlp so it's got a fully connected layer right so it's going to do something like this it's going to attend all of these positions and this weight here is some weight b one one and that's the first element of the output vector then the second out element of the output vector will attend like this again it's a fully connected and it is becoming a mess really quickly but you can understand that basically let me take another color so basically this one here is going to be some w12 and and you have a collection of these weights which are particular to this token here so it directly learns that for that position for that token which corresponds to a certain position in the image so this is the input image maybe this is this token here dispatch so these weights will learn how to encode the information from this particular patch and that's why they don't need to use any position encodings hopefully that was clear enough if not let me know in the comments i'll try to explain it further um okay so that was the detail i wanted to explain and the the final thing i want to mention is this year here is some dark magic here cosine learning rate i really wonder why are we still using these without having any theoretical justifications um uh like a nice a small footnote of why they use that particular schedule and not something else like a simple linear schedule or even some constant learning grade schedule would be really appreciated um and again following common practice and i highlight common practice because so we also apply fine tune higher resolutions with respect to those used during pre-training so what happens here is that basically people show that when you're training for for like uh vision benchmarks you pre-chain on certain resolution like maybe 24 times 224 and when you want to fine-tune you actually increase the resolution and that that will help you boost your performance so you fine tune on 384 times through 384 for example and as i said so i highlight the common practice because i'm not sure if we still have an understanding of why this is making things work any better so we are just using the legacy ideas legacy heuristics and keeping them and kind of keeping them in all of the present research and i guess yeah we sometimes have to do that because otherwise like the research will fall apart you have to to clinch onto something uh but basically it'd be nice when we had more papers uh explaining why these things work and not just combining them like us black boxes um there is one more detail i want to mention and we are done basically uh because they are doing this uh up scaling during the fine tune process uh they somehow have to adjust the weights of the mlps so the reason that is is because if you remember so the input image can be represented as s times c and because the token mixer basically attends across the uh number of patches uh when we increase the resolution so when we increase the resolution but we keep the number of the the size of the patch the same that means we're going to have more of these patches so that means s goes to s prime which is a bigger number and that means this thing won't fit into the previous mlp which had certain bandwidth so it was maybe it was maybe it could attend like this it was smaller so now we have to kind of take the weight matrix of that old uh mlp and we need to initialize this new weight matrix which is bigger somehow and what they did is they just took these and stacked them block wise here and that's that was the initialization method they used to to fine-tune the the models so yeah just an engineering detail i thought it's interesting and important to to mention uh that there are a lot of these uh details that go into making this thing work and the reason this works is if you think about it if you if you now multiply the input vector with this new matrix this thing is basically going to attend to the first part and this thing is basically going to attend to the second part so it's like you're applying these in parallel and so that's that seemed like a sound uh way to initialize the new weight matrix so that's the the main reason they did it like this hopefully you found this video useful if you did uh leave a like share the video and see you next time

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video I explain the MLP-Mixer: An all-MLP Architecture for Vision paper, aka MLP is all you need. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ paper: https://arxiv.org/pdf/2105.01601.pdf ✅ Sutton's blog: http://www.incompleteideas.net/IncIdeas/BitterLesson.html ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 We've gone the full circle 01:50 Bitter lessons by Sutton 02:50 Architecture overview 06:45 Rant, rant, rant 08:20 Results 11:00 Pareto frontier 15:10 Visualization of learned weights 18:30 Decrypting equations 21:20 No positional encodings 24:10 Dark magic, initializing weight matrices during fine-tuning ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCpheKE 📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER: Substack ► https://aiepiphany.substack.com/ 💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS: GitHub ► https://github.com/gordicaleksa 📚 FOLLOW ME ON MEDIUM: Medium ► https://gordicaleksa.medium.com/ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #mixer #mlp #allyouneed

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 41 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

The MLP-Mixer paper introduces an all-MLP architecture for vision tasks, which achieves similar results to transformers and CNNs while being more computationally efficient. The architecture uses token mixing and channel mixing MLPs, and is pre-trained on large datasets such as ImageNet and JFT 300 million dataset. This video explains the paper and its key contributions.

Key Takeaways

Pre-train the MLP-Mixer model on a large dataset such as ImageNet or JFT 300 million dataset
Use token mixing and channel mixing MLPs to achieve comparable results to transformers and CNNs
Fine-tune the pre-trained model for better performance
Adjust the weights of MLPs when increasing resolution
Initialize new weight matrix by stacking old weights block-wise

💡 The MLP-Mixer architecture can achieve comparable results to transformers and CNNs while being more computationally efficient, making it a promising approach for vision tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

Chapters (10)

We've gone the full circle

1:50 Bitter lessons by Sutton

2:50 Architecture overview

6:45 Rant, rant, rant

8:20 Results

11:00 Pareto frontier

15:10 Visualization of learned weights

18:30 Decrypting equations

21:20 No positional encodings

24:10 Dark magic, initializing weight matrices during fine-tuning

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom