Basic Theory | Neural Style Transfer #2
Key Takeaways
This video introduces the basic theory and concepts of neural style transfer, including the underlying mathematics and deep learning architectures
Full Transcript
welcome to the second video in this video series owner old style transfer where are you gonna learn how to do this and let's jump into the video so this video will give you deeper understanding of the basic neuro Starr transfer theory but before I go there I'd like to give you an overview of the whole series if you only came for this video feel free to skip directly to it so let's start so last video was more of a teaser of showing you all the things that mural style transfer can actually do and as I already mentioned this one will be about basic theory and the third one will be about static image neural style transfer using the optimization method l-bfgs or atom numerical optimizers whatever so and and in general the first part of the series will focus on static image style transfer whereas the second part will focus on videos and fourth will be the second the app index pretty much to this video on a more advanced and euro style transfer theory and fifth one will focus on so not using the optimization method but using CNN's you're just plugging an image input as an input and you get a stylized image ah and I'm also gonna teach you how to train your own models so that you can use different styles then we'll talk a little bit about segmentation which will help you stylize only certain portions of the of the image and then we'll jump into the videos part starting with primitive video where we're gonna learn how to apply it on a per frame basis without using any temporal loss but then we'll start using including the temporal loss itself inside the models and we'll get much more stable model there and the output also of course the tenth we'll focus on training those models and the last one in this series will be about going in general like try and use some some other family of models like mobile nets efficient nets some state-of-the-art models and see if that gives us better results in general next off I want to tell you more about what I want this series to actually be so I want to be a code heavy it's gonna be really practical and except for this video and the advanced theory one and I want to keep it simple I want to only use part or two the framework and Python as a programming language so no dual boots new system dependent scripts no exotic languages such as Lua absolute framework such as tort or cafe and no tensorflow even though it's still relevant especially with a 2.0 version but I just want to pick one and I think my torque is winning the battle and it's much more it's nicer to you write in code will be shared through my github repo so you have you can just I wanna make it I wanna make it really simple you can just get clone my repo create the environment file clean environment using my environment file and that's it you can start playing straight ahead so that's the end of the series overview now let's jump straight ahead into the video itself and let me start off with defining what the actual task will will be so we get one image as as the first input which contains the content that we want to preserve we get a second image which has the style that we want to transfer to this content image we combine them where the plus denotes neural style trends transfer transform and what we get out is a composite image that's a stylized version of the content image and that's it that's a task next up let's see two basic types of style transfer and the first one is the one I'll be showing you in this video series it's the artistic style transfer where the where the style imagery we want to use is actually artistic image can be either a cartoonish and drawings painting whatever and the second type of style transfer is a photo realistic style transfer where both the damages are actually real and we try to just just mimic the style of the of one of these onto another and get a composite image out as you can see here on the screen so I thought you'd be worth including some history here and basically there's a difference between a style transfer a neural style transfer style transfer is something that's been going on for decades now already and neural style transfer is just the same thing but using neural neural nets and it all started in natives pretty much where people were using simple signal processing techniques and filters to to get a stylized images out like the one you see here and then in 2000 they started applying patch based methods like the one here called image knowledge is where you need to have image pairs so the contact image and the stylist version and then given the new content image you can stylize it the same as the pair that was previously given and this method gave some decent results but only in 2015 did we get to the neural style transfer ie applying commnets to do the same thing of transferring and it outperformed every other approach previously developed and now to decor NST algorithm itself so where it all started it all started in 2015 where long gages and his colleagues through this research paper titled I knew rule algorithm of artistic style and what the key finding of the paper was is that the content and the style representations can be decoupled inside a CNN architecture and specifically a bgg not played a key role in this in this paper and you can see the architecture on the screen a bit more detail on the vdg network itself so it was trained on the image net data set for the tests of image classification and object localization but it actually wasn't a winner on that year's classification challenge it was a first runner up the first nail productivity the one the competition was Google Matt or Google Leonard but vdg did one the localization task let's see what ovg had in this honesty paper so it helped create a rich and robust representation of the semantics of the input image so how we find the counter representation is the following we take some images in them but we feed it through the CNN the vgg not here and we take the feature maps from a certain layer like let's say a comp for one and those future maps are what represents the content of the input image and it's really that easy and just for the sake of making feature maps less abstract let's see how they actually look like for this concrete image for this line image and you can see why they're called future maps they can basically be interpreted as images and they contain either a low-level details such as edges stuff like that or high-level details depending from which layer of the vgg net or in general CNN do you extract them out okay and now for the fun part so let's take a Gaussian noise image which will eventually become the stylized image that we want and feed it through the bgg and we'll get its contour representation which is currently rubbish but let's see how we can drive it so that it has the same representation as the input image so we get those two images in zimba we feed them through the vgg and we get their feature maps which are as I already mentioned the content representation the current content representation of those two images we we can flatten those feature maps so each feature map becomes a row and this output matrix and now we need to drive those P and F matrices to be the same and we accomplished there using this loss which is a simple MSU lost where you adjust take a element-wise subtraction and we do element wise squaring on those elements and we just try and drive that that loss to zero I would try to minimize it and now for the really fun part and we'll don't see what happens when we drive the last down to zero and I'll just give you a couple of seconds to watch the animations so what you can see on the screen is on the left side you see what happens when the you see that the F matrix is getting closer to the P matrix which is equivalent to the last getting down to zero and on the right you can see the optimization procedure itself nice image becoming slowly becoming the input image the bottom animation is just the whole optimization procedure whereas the upper animation is just the initial part of the optimization procedure in a slow mode because it happens really fast using l-bfgs optimizer but you can see on the next screen is that the l-bfgs is much more is much faster than the atom optimizer and only hundred creations l-bfgs already seems to be a morphing this nice image into content image whereas atom is only just beginning to do that now for the second most important idea in this video and that's how do we capture the style of an image so how do we find its style representation so we have this styled image input style image we feed it through this v2g net and we get a set of feature maps this time taking those from starting from layer cam 1 1 and going through layer count Phi 1 and what we do is we construct this feature space over these feature maps using something called Graham transform so we create Graham matrices out of those feature maps and the set of those Graham matrices is what ultimately represents the style for image or let's call it the styler presentation and now you might ask what's what's a gray matrix and that's a legit question so I took a style image as an input I fed it through the vgg and from one of those layers I constructed a gram matrix and this is how it looks like this is exactly how it looks like and it answers an important question and that which feature maps tend to activate together we already saw how the feature maps look like and in one of the previous slides and now we have an answer to this question so it's a simple covariance matrix between different feature maps and the way we calculate an element in this matrix is by just doing a dot product between two feature maps and that just captures the the texture information as it turns out let me give you some more intuition behind why Graham matrix actually works so here's a hypothetical example where on the upper row we have a hypothetical output three feature maps and on the bottom row the same thing or just for some other input image and if we would take element wise subtraction between those two rows would get a nonzero output which means we have a nonzero common loss which means that the input images have different semantics right different content so the dog is upside down on the bottom row but on the other hand if you take a look on the right side you'll see that the grand matrices are actually the same which means that the two input images are stylized in the same manner which is kind of true if you take a look at it and the style loss will be 0 because that so let's make it more explicit how we calculate the style loss so we have the inputs all image we have the input noise image we feed them through a VG we'll get a set of feature maps here I am only showing for simplicity just one set of feature maps we construct gram matrices over those feature Maps and what we do is just a simple MSE loss again which is just a element-wise subtraction followed by element wise squaring and the final style loss is actually just weighted sum of those terms for every layer in the and that's it now I'll do the same thing is a different pecan image let's see what happens when the style or presentation of the input noise image becomes the same as the style representation of the input style image and I'll give you a couple of seconds to just watch the animations watch closely and you can see there's a spike there so what happens is on the left side G represents the set of grand matrices of the input noise image and a represents the set of Grammy toeses of the input style image and as they are getting closer to each other on the right side you can see an animation where an input noise image initially input noise image is slowly becoming stylized it's capturing the style of the simple style image although it's district disregarding the semantics it's just capturing the style that's it and now putting it all together so the total loss is away a combination of a con loss and the style loss and what it basically says is the following we want the input noise image they have the same style of presentation as input style image and you have the same content representation as the input können image and that objective might not be fully minimized because a there does not exist a solution or B we cannot find a solution but still we'll get a visual appear instead we want and just take a look at the animation here and the line is slowly appearing in that style image and we are getting the composite image out that we and that was there was like the whole point of this video so that's it for the second video if you like my content consider subscribing gently push that like button and seeing the next video [Music]
Original Description
❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
The second video in the neural style transfer series! 🎨
You'll learn about:
✔️ The basic theory behind how neural style transfer works
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
I hope the video provides you with a strong basic intuition and understanding, but for those of you who want to take it further here are some additional materials relevant to this video:
papers ►
✔️(original NST paper, arxiv, old) https://arxiv.org/pdf/1508.06576.pdf
✔️(original NST paper, CVPR, new) https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf
blogs/articles ►
✔️ History of style transfer (3 part blog series by Adobe's Aaron Hertzmann) https://research.adobe.com/news/image-stylization-history-and-future/
✔️Nice overview of fast NST algorithms https://www.fritz.ai/style-transfer/
Note: It seems that YouTube's video transcoding process messed up the intro and outro NST clips - they look much nicer and higher quality on my machine.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 - intro & NST series overview
02:25 - what I want this series to be
03:30 - defining the task of NST
04:01 - 2 types of style transfer
04:43 - a glimpse of the image style transfer history
06:55 - explanation of the content representation
10:10 - explanation of the style representation
14:12 - putting it all together (animation)
[Credits] Music:
https://www.youtube.com/watch?v=J2X5mJ3HDYE [NCS]
[Credits] Images:
Found the useful Gram matrix intuition image in this blog: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-neural-style-transfer-ef88e46697ee
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► https://www.patreon.com/theaiepiphany
One-time donations: https://www.paypal
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 2 of 60
1
▶
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany
More on: Neural Network Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (8)
intro & NST series overview
2:25
what I want this series to be
3:30
defining the task of NST
4:01
2 types of style transfer
4:43
a glimpse of the image style transfer history
6:55
explanation of the content representation
10:10
explanation of the style representation
14:12
putting it all together (animation)
🎓
Tutor Explanation
DeepCamp AI