DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Key Takeaways
The video explains the DALL-E paper on zero-shot text-to-image generation, covering its architecture, training, and results. It discusses the use of VQ-VAE and transformers for image generation and the evaluation metrics used to quantify image quality.
Full Transcript
what's up in this video i'm covering zero shot text to image generation from the uh openai team or dali for short and uh this this work came out a couple of months ago but i didn't see anybody covering the paper so i thought walking you through and explaining in detail how this thing uh works and let me just briefly explain what dali is all about and as you can see here uh dali can synthesize these awesome images using just textual prompts so on the left side here we have uh so this is the the the prompt you're using and inputting into delete and so a taper made of accordion a taper with the texture of an accordion and you can see that dali is able to remarkably mix up these concepts into something that looks like uh like a like a plausible drawing that a human would how a human could combine these concepts as well so here we can see some kind of like a merge between the accordion and taper here we can see that the accordion has so this handle is kind of in the taper shaped uh head and yeah it's pretty awesome some other examples involved this one so an illustration of a baby hedgehog in a christmas sweater walking a duck and just like focusing maybe in this one it looks super amazing uh like you can imagine this being used to generate uh ad hoc logos or or or gifs whatever so huge potential like use cases for for this model obviously does make some mistakes at moments so here we can see just a small hedgehog instead of a dog so here it's kind of the face is kind of blurry so it's not perfect but like it's a it's a huge leap forward and i mean i'm loving it um here we have a neon sign that reads backprop a neon sign that reads backprop backprop neon sign so obviously you can see there is a lot of like prompt engineering this is something that somebody called it on twitter like uh programming 3.0 3.0 so basically you have programming 1.0 that's your classical software engineering then you had programming 2.0 that's machine learning where your data is your program and here you're you're finally starting to just kind of hack the the prompts in order to get the desired result from your model and it looks like an emerging paradigm and we'll be seeing more and more of this with gpt models as well etc so uh you can see results are pretty cool you can see backprop here rendered and i mean pretty awesome and finally there are some like uh the like the this emerging property of image translation as well so uh basically uh they say here so the exact same cat on the top as a sketch on the bottom and in this example here you can see it's i mean it got perfectly it got it perfectly right so it kind of just extracted the edges from this picture of a cat here and looks pretty awesome on the other examples it's not that perfect but yeah i mean the very fact that it can actually do this without being explicitly trained to do these kind of stuff so this is just an emerging property same as we had in in gpt models where we had uh we basically trained the model to predict the next token we got like something like machine translation as a side effect etc so yeah same thing happens here with delete basically and um having mentioned gpt you can assume uh that transformers and this is open ai so you can just assume there is a transformer behind this and we'll get to that in a moment let me just uh kind of touch on the on the one of the components of this of this of this model so basically we have so dali consists out of two parts so one is uh the qva model which i by the way i've covered in one of my previous videos as well as vq game paper which is super similar to delete paper except for the gang component so i do recommend you check those out i'll just link them somewhere here but having said that i'll briefly explain what we qva is is in this video so uh here i just want to point out one thing so the qva is basically an autoencoder like uh so it's a vae with uh like a quantized uh discrete latent space and it has this like encoder like structure right so we have a bottleneck there and then we reconstruct back the image so this is your input image goes here the reconstructed image let's call it r comes out here so here you can see that this is the input image and after it goes through the whole vq va pipeline out comes this kind of blurry image so the the global like composition is still here everything is still here but you can see that it's blurry and that's something you don't see with gans gans have crispy clear output and here as well so we have this text here and it's kind of blurred out here in the in the reconstruction and similarly for these squiggly lines some of them even get lost so you can see here if we focus on this one you can see that like it kind of totally disappeared okay so it's not perfect but like uh that the reason they do this is so they can model because it's really hard to model the image directly in the in the in the pixel space that's the whole idea and i'll get to that in a moment okay so i mentioned transformers let's let's see so this is opening i don't forget so when compute model size and data are scaled carefully all regressive transformers so this is the original transformer paper which i've covered in one of my videos you can check it out as well uh you have achieved impressive results in several domains such as text so this is your these are the gpt family of models so gpt one two and three uh images so this is your image apt from opening eye and audio so this is jukebox uh paper from open ai as well so you can see a trend here uh what they do is they they take a different modality so for example text they take a transformer they scale the data they scale the transformer and they get awesome results and then we have images similar things happen with images so basically they also used a transformer they modeled images directly in the pixel space and the consequence was that they could only generate up to i think two like 64 times 64 images don't get don't um i may be wrong here but i think it's something like that so in any case it's a super small resolution compared to two nowadays we have generating models such as stylegam v2 where we have images that are like a million pixels etc uh so and finally we have audio uh again they used a similar approach to this paper so again we i think they use wikivae and a transformer they scaled it up and they generate they managed to generate some nice uh like music etc okay so question is they ask is uh could data set size and model size be the limiting factor of current approaches so here we're modeling both the text and the image images jointly and so they say they train a 12 billion parameter or regressive transformer and 250 million image text pairs so that's it so data plus big transformers and you get awesome results that's that's the the the main uh like kind of light motif here uh okay so briefly uh explaining the details of of this uh like pipeline that they have two stages in the first stage they train the vq vae model in the second stage they train this all regressive transformer directly in the latent discrete latent space of the vqbe so let's kind of just briefly touch on it as i said i have a whole video on this you can check it out uh but let me just kind of briefly explain it so how you train this thing is you have an image you you pass it through this this will be some kind of a cnn so some kind of encoder uh outcome uh these uh these vectors you'll have for example this will be like one vector here and just this will be the other dimensions along this side and what you do is you kind of snap that that so this will be just like a collection of feature maps right so you just take this vector and you'll find the closest one in this in this code book of vectors so this is just your embedding table you find the closest one according to l2 distance and you snap it to that one so you can see for example if this one was maybe the closest code book vector was one here you can see we snap it to one so we have just a symbol one here and then later you can just kind of index using these indices into the codebook vector and you can actually find the actual vector so you'll put e1 here so that's this vector so this vector will replace this one and then you pass these these vectors these quantized vectors into decoder which is also just some kind of transposed convolutions and you get uh the cnl you get the the reconstructed image back so uh the loss is uh fairly simple they just have reconstruction loss so they used msc so that's mean squared error plus they have some loss that encourages encourages the codebook vectors to be close to these vectors and these vectors to be close to codebook vectors so uh the one difficulty is it's hard to it's not hard it's impossible to actually uh backprop through this thing because of this uh quantization step and so what they do is they whatever the gradients for these vectors here are uh they just copy paste those as you can see this by this red line you just copy paste those into the encoder and then because this part is differentiable you can just spec prop through the cnn and figure out all of the other gradients but that's a difficulty so this part because of the quantization that's one of the main things they had to solve and yeah that's that's in a nutshell how how it works so in my vqbae video i even went through some code so if you're still confused how this exactly work you can check it out but that's the whole like uh like a rough idea of how the qvae works so once you train uh this model so what you have is you have you you train these embedding vectors you train the encoder you train the decoder and so now the second step is this uh transformer thing so we concatenate up to 256 pp encoded so that's by pairing coding uh text tokens with the 32 by 32 uh that's 1024 image tokens and train an autoregressive transformer to model the joint distribution over the text and image tokens so this 32 by 32 let me just show you show you where that comes up so basically this means that this thing here will have 32 vectors here and 32 here so we'll have 1024 tokens uh like in total and these are image tokens so uh those kind of capture the information about the image and then using decoder you can just decode them back into the pixel space okay um there's this part about where they mentioned elbow i think it's it will be more confusing than than good to explain this but if you ignore why for a second so if we ignore the labels uh the captions sorry if we ignore the captions this is basically your your vae objective so just elbow basically this is the ln so of p of x uh given that that's your reconstruction loss and then depending how you decide to model your your data so if you take gaussian here so basically what will happen is uh because gaussian has some some some structure something like this so we have c we have e raised to the power of blah blah blah we have uh basically x minus mu squared over something sigma doesn't matter but basically when you do ln of this so this is a constant so we don't care this is e these two are so ln and e are just inverse of each other so you end up with x minus mu squared which is your basically msc loss so that's how you end up with msc loss even though you have this abstraction here and i know this part confused me so hopefully this will like give you some epiphany moment um and similarly here uh this is just a kl divergence between the approximate posterior and your prior so that's basically your your your uh like vae objective and you're trying to to maximize this part because in turn because it's a lower bound that means you're you're going to maximize the like the likelihood of your data so you just kind of want to maximize the reconstruction probability and you want to minimize the kl divergence between those two that just in a nutshell i don't want to get into maths uh as i said the vqva video has a better explanation of this of this component as well as some links to some cool blocks which you can check out at your own pace and understand the mathematics behind this but the the confusing part is you actually won't see the kl part especially in vq vae model the reason being is that the kl divergence is constant in this model so it's kind of confusing to it's a nice thing to think about it like this but like for practice from a practical practical standpoint it's just confusing um okay so a couple of details uh where vqgan and uh like vq vae papers differ from delhi so um i mentioned the straight through estimator so that's this part let me go back here so this is called the straight through estimator this red line this is basically your you copy pasting the gradients from the decoder over to the encoder so that you can back prop through the encoder and train your model okay so that's a straight through gradient uh like a gradient estimator and here so they say we instead use the gumball soft max relaxation replacing the expectation or blah blah blah blah blah all in all um this is a modification they made as well as the likelihood for the p sub data is evaluated using the log laplace distribution so i mentioned previously i mentioned this gaussian here so gaussian leads to you uh having mse as an objective in the loss function this thing because they are using plus they'll have some different objective and let me briefly go to the ape index and show you what i mean by this um this is kind of different compared to other papers and i think it's pretty neat idea so the l1 and l2 reconstruction objectives are commonly used when training vaes so l2 was what we saw so msc is basically l2 squared right and so these objectives correspond to using laplace and gaussian distributions for the this term okay and that's what i mentioned so explain the thing with gaussian similarly it goes with laplace uh you'll have the same logic behind it so you'll just end up with uh instead of square you just won't have the square component and there is a strange mismatch in this in this modeling choice pixel values lie within a bounding interval so your pixels will be in the like zero one range right that's your that's your like normal rgb image and they say here but both of these distributions are supported by the entire real line hence some amount of likelihood will be placed outside the admissible range of pixel values and so i just kind of draw uh like pasted the distributions here so your output pixel values will be pretty much here from zero to one no matter how you parameterize this laplace so if you put mu to be equal to 0.5 so somewhere here you're still going to have some non-zero probability being given to values which are outside of the zero one range so this is the range we care about right the zero one range and the these distributions these assumptions that we have that the output like uh random variables are modeled as gaussians or laplace simply do not make that much sense if you think about it and that's why they they replace this so they say here so we represent a variant of laplace distribution which is also supported by a bounded interval uh blah blah so this pdf is defined on zero one and is given by this this pdf okay so what it did is they just applied uh they say somewhere here a sigmoid let me just a sec so yeah so we consider the pdf for random variable obtained by applying the sigmoid function to a laplace distributed random variable and so instead of using mse what you'll now use to in order to do to maximize the likelihood the likelihood of your data so to increase to decrease the reconstruction loss you just you'll just do log of this so just ln of this term and whatever pops out that's what your objective will be like and so that's in contrast with msc loss uh okay and just this is the sigmoid you have domain goes from minus infinity to plus infinity and we have the codomain the the image being from zero to one and that's the reason why the the random variable will now be constrained to be between zero and one uh like contrast that to laplace as i mentioned it it will have values which are outside of the bounded region we are interested in so that was just a small small uh glimpse into these uh details which they they kind of changed compared to the original paper wiki paper let me now get back to where we we ended it i'll skip the explanation of the gumball softmax relaxation i think it's just a detail and there is a whole paper behind this idea so i'll have to skip it but like uh in a nutshell they replace the straight through gradient estimator with this novel gumball soft max relaxation method and now let's jump into stage two okay so i mentioned uh that after we trained the vq vae we have a discrete space and now we want to learn how to uh train a transformer on top of that on top of those image tokens and they say here um given a text image pair we bpe encode the lowercase caption using at most 256 tokens we woke up size of 16 000 and encode the image using 32 by 32 tokens with woke up size 8k okay finally the text and image tokens are concatenated and modeled or regressively as a single stream of data let me try and break this down for you if it's already not clear but like what they do is uh so imagine you have you have a caption so you have some like i'll just draw something like this so we have a caption that's your sum that's some text and you'll have associated image with a caption so something like this will have a squiggly man here random stuff whatever so that's your image and so what happens now is the following so they first lowercase this so let me just denote that as l then they do bpe so that means uh they'll basically tokenize the words into into some smaller units and what they have is they have this vocab size of 16k so that just basically means you have a huge table here and uh that table contains contains 60 16k vectors which are learned um okay so now what this thing how this thing will work like is the following so let's say you have a sentence like i don't know like monkey so monkey monkey king and then we'll have whatever like just three dots okay so this is your input sentence and so bpe will just kind of break down your words maybe into some sub words like i don't know you'll have maybe one you'll have key you have i know maybe key i and ki and so what happens is you'll just take the sub word you'll find the appropriate vector and you'll replace your sub word with this vector and etc etc and so what they say is they have max of 256 tokens so that means if we have more than that we'll just kind of truncate if we have less they have some special like uh tokens which they'll train for that particular purpose where where when we have like those when we have uh like a less number than 256 but that's a minor detail and so that's what happens with the text so we have we end up with uh 256 tokens here okay now the following thing happens with the image so this one just gets fed into the encoder of the vq vae which was previously trained in stage 1 and outcome 32 by 32 tokens so basically this is 32 by 32 and you're kind of gonna unroll these and so this will be your your target so basically you'll just put start of sentence token here and then you'll basically all start off image some special token and then you'll just unroll this thing here okay so just unroll it here and that's your input into your transformer and what the output will be is the following so this very same thing just shifted by one to the left okay so you'll have something like this and then you'll have end of sentence token or end of image in this case okay so this that's it and uh so now how the whole thing works is you have a huge transformer i think they used i think like gpt2 or something uh with some sparse attention maps and so you have a transformer here and then you have your like usual usual usual thing you just pass this in you transform it through the multiple layers of transformer and you'll want to predict these so you'll want to predict this thing um from so from the start of sentence token so in this case these are image tokens but just imagine for a second this was uh like a textual token in that case uh if this was text like monkey king would have m here and we'd have m here and then would the second letter would be o so would have o here so as you can see you want to predict from the from the like the special token you want to predict m from m you want to predict o because o is the next one so you're just trying to predict the next token uh simple stuff pretty much and you can do this in parallel and that's how they train this whole thing so this thing here will be 256 and this thing here will be 1024 and you just train this transformer to kind of learn how to to do this next token prediction and then how you actually use this later in inference is you just prepend this uh caption and you prepend this special token like this one here and then you just start sampling from the model until you get 32 by 32 tokens you just pass that through the decoder and out comes a novel image uh so that caption can be obviously completely novel so the model has never seen the caption during training like the the examples i showed you showed you with taper and hedgehogs so yeah that's how it works in a nutshell um now let me just show you some some results they have uh here so uh they compare they compare obviously the the the delhi paper with different gan uh baselines so like the afghan dmgan attention gan etc and here we have so this is the ground truth caption a very cute cat lying lying by a big bike and here we have the actual validation images so the images that go with these captions so that's the ground truth and if you like take a look at the results you can see two things so first thing is delhi has better outputs like it's just if we focus on this thing here a living room with a tv on top of a stand with the guitars sitting next to next to you okay and basically you can see this looks really cool much better than the ends uh the the main difference is uh like vq vae is as as we saw in the beginning of the video has blurry output that's why we have blurry kind of blurry results here even though they are like composition is much better compared to gans where gans are kind of gans are kind of the chat programmers of the generative modeling world because they're super confident they just output some value super crisp even though it's complete bs and so on the other hand we have vq vaes which have better composite compositionality but they have blurrier results that's that's something you can you can you can take off from from from from this image okay let's let's now continue um one thing i want to mention is and they actually explicitly mention it here is getting to model the model to train in a 16-bit precision past 1 billion parameters and if you remember they have 12 billion parameters so that they have 12 billion parameter transformer without diverging was the most challenging part of this project this just makes you appreciate engineering so much because if you think about it the ideas in this paper are nothing brand novel like vic ugand has super similar ideas uh jukebox all of these papers are kind of variations of each other there is nothing super novel here uh from the research standpoint in my like honest opinion but like from the engineering standpoint there is a lot of things you have to solve in order to get this to work so one of the things was that they they had to train this in 16-bit precision and they had problems with overflows so they had to devise multiple techniques like this one here and they also had to do some engineering hacks in order to kind of avoid the memory bottlenecks they had on their gpus so they couldn't fit the whole model so they had to do this sharding between multiple nodes etc so just keep that in mind um i wouldn't say there is anything super novel research-wise in this paper but just when you scale it up when you collect the data when you invest just a bunch of money a bunch of engineering like just a bunch of dev like ours you get you get some awesome results and i think that's a valuable lesson we don't always have to just invent novel ideas sometimes we just have to scale up and kind of push to the limits the current work we have um what i have here is um if you're familiar with clip if you're not i also create a video on clip you can check it out but basically what it do here is they have this automatic way of filtering the output from the dolly so for example you generate you generate 512 images from delete so we have a bunch of images this is one this is second one this is third one etc so you generate a bunch of them and then you have this model called clip which will let me just draw it like like a circle so clip will just basically you input an image and you input a caption so for example this one a group of urinals is near the trees and clip will tell you so this is clip clip will tell you how likely is this image to be under that caption let's call it that way so that means you can do explicit sorting of these images according to the scores that clip model gave them and you just take you cherry pick the the the ones with high scores and you can see the results here are pretty astounding so if you just have if you just generate a single image you can see it's pretty random it you don't see even the urinals then when you have when you generate eight images and you pick the best one the best one according to clip you get better results and finally if you have 512 images this one looks much better so this one looks way better compared to the ones below okay so yeah this just works um it's just a heuristic to kind of uh get some better results from these generating models uh same goes from some different captions and that's the that's the rough idea again check out the clip video if you're interested in learning about details how that exactly works okay um some additional results they compare this to uh this df can baseline um yep they just get better realism and and accuracy results compared to the baseline nothing fancy there here they they show some results some zero shot samples on this cob data set this cup data set is basically consists of birds and the thing is the the the original that huge data set i mentioned that the the dali model uses has like bunch of images but the distribution of those images and captions is completely different compared to this data set and they show some not that great performance on this on this on this cup data set you can see some birds here it's not perfect and especially when you focus on these quantitative results you can see that uh delhi is much worse compared to other baselines on that data set because as i said it's out of distribution in a way so it doesn't perform that well uh what they do here on these diagrams is they they plot the the inception score and the fid so that's for shay's inception distance those are just some metrics which people use in order to kind of try and quantify how good are the generated images because the problem itself is ill-posed same as neural style transfer same as deep dream those kind of images you kind of you don't have a clear-cut way to determine which image is better so these are some heuristics like is and fid and what they do is on next axis they have a blur kernel radius so what they do is they kind of whatever is generated out of out of dali and other base lines like gans they just kind of blur them with increasingly bigger filters so you have initially you'll have some so you have an image here and hopefully you're familiar with this this is just some basic image processing technique so you can create a kernel like a gaussian and you just kind of like slide over the image and apply the the like the you just do the kernel operation the same as in cnn's you just do like element wise multiplication you add them up you kind of map that to an output pixel and if you do that with a gaussian kernel it'll just kind of blur the image and what they then do is they kind of start increasing these kernels uh and so they they have a bigger one and so uh the the bigger the kernel the the more blurrier the image will be and they show that when they kind of uh blur the image uh they get better results so uh this is the lead and you can see that uh higher is better and so the bigger the blur kernel is the better results we get from the leak compared to other base lines here fid should be smaller so you can see again that um even here after just uh increasing the blur radius to one we have better results compared to baselines why do this the reason is as i mentioned in the beginning of the video vq vaes have plot problems because the outputs are not crispy and so they suspect that these metrics may be penalizing them because of that so in order to get on an even ground they take the gan output they blur it so as to compare to kind of make it fair to to dali and then they kind of calculate the fids and and is i mean this is kind of way to i guess measure the like how high quality the composition of the image is uh and put less focus on the actual details i guess that's the whole point of this on fid they show that the inception score is much lower which is a bad thing you want to have highest inception score and they showed that the fid is higher so that means it's having problems with the cop data set as i already mentioned cop is kind of different distribution of data is different there okay here pretty obvious um fact that's that when we increase the sample size for re-ranking so that's the clip part the more so if you have 512 uh generating images that's going to be way better than you generating just a single image as you can see here the inception score goes up the fid goes down and so that means there is a sweet spot maybe around 128 generated images where you can then automatically select the best one and this just improves the results on both metrics fid and is okay those were the results i guess i guess that's pretty much it let me just go back to the appendix and show you one more thing one more thing so yeah you can see the patterns in the transformer they have just different patterns they're not just simple causal masks and this was introduced in their sparse transformer paper so you can check that out if you if you wish so and i just want to show you some additional results on the image image translation somewhere here okay so the exact same cat on the top color on the top colored red on the bottom so you can see here again perfectly image to image translation even though this task so you now know how the dahle model is trained it's just trained to first train vq vae then you train the other aggressive transformer to just predict the next image token and you can see that uh out comes this this ability to do this kind of stuff as well as so here they mentioned two panel image of the exact same cat on the top a photo of the cat on the bottom the cat with sunglasses and you can see again like just placing some sunglasses onto the cat really awesome uh results i love this here with postcard uh the exact same cat on the top as a postage stamp on the bottom uh pretty awesome results and with that i'll leave you here okay that was it uh hopefully you like this video uh let me know what you think subscribe share out this this content if you like it that's the best way you can support me and until next time bye bye you
Original Description
❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany
In this video I cover DALL-E or "Zero-Shot Text-to-Image Generation" paper by OpenAI team.
They train a VQ-VAE to learn compressed image representations and then they train an autoregressive transformer on top of that discrete latent space and BPEd text.
The model learns to combine distinct concepts in a plausible way, image to image capabilities emerge, etc.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper: https://arxiv.org/abs/2102.12092
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 What is DALL-E?
03:25 VQ-VAE blur problems
05:15 transformers, transformers, transformers!
07:10 Stage 1 and Stage 2 explained
07:30 Stage 1 VQ-VAE recap
10:00 Stage 2 autoregressive transformer
10:45 Some notes on ELBO
13:05 VQ-VAE modifications
17:20 Stage 2 in-depth
23:00 Results
24:25 Engineering, engineering, engineering
25:40 Automatic filtering via CLIP
27:40 More results
32:00 Additional image to image translation examples
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► https://www.patreon.com/theaiepiphany
One-time donation:
https://www.paypal.com/paypalme/theaiepiphany
Much love! ❤️
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
Zvonimir Sabljic
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► https://www.linkedin.com/in/aleksagordic/
Twitter ► https://twitter.com/gordic_aleksa
Instagram ► https://www.instagram.com/aiepiphany/
Facebook ► https://www.facebook.com/aiepiphany/
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► https://discord.gg/peBrCpheKE
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 55 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
▶
56
57
58
59
60
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (14)
What is DALL-E?
3:25
VQ-VAE blur problems
5:15
transformers, transformers, transformers!
7:10
Stage 1 and Stage 2 explained
7:30
Stage 1 VQ-VAE recap
10:00
Stage 2 autoregressive transformer
10:45
Some notes on ELBO
13:05
VQ-VAE modifications
17:20
Stage 2 in-depth
23:00
Results
24:25
Engineering, engineering, engineering
25:40
Automatic filtering via CLIP
27:40
More results
32:00
Additional image to image translation examples
🎓
Tutor Explanation
DeepCamp AI