Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany · Beginner ·🧠 Large Language Models ·5y ago

Skills: LLM Engineering80%LLM Foundations70%Fine-tuning LLMs60%

Key Takeaways

The video discusses the development of a deep learning project, specifically a transformer model, and shares the creator's personal story and experiences. It covers the implementation of the transformer model, training, and optimization, as well as the importance of learning, applying, and being consistent for success.

Full Transcript

finally finally i've open sourced my uh implementation of the original transformer paper uh attention is all you need and it took me like three weeks of non-stop working on this and i'm super happy i learned a lot and hopefully you'll find it as a valuable resource as well so yeah let me just walk you through this readme this is the architecture i have uh some nice visualizations here and i've trained i have two uh two models which have pre-trained which have linked so yeah before before i get into this uh like uh the the code itself i just wanna tell you about a couple of things like the the the whole journey um i had uh learning about this and so those of you have been following me um over the last uh like a couple of weeks uh you know that i've been um so i've been working on transformers for more than a month now and i first started by like just reading about theory uh reading like research papers for more than like two plus weeks and then after i finished that one i basically started coding the the implementation from scratch and that took me around three weeks more or less so let me start with why i did it so the thing is uh i looked for like good resources to understand transformers but not on the like level of theory but like to be able to actually understand the code itself and i basically only found two decent resources good resources basically uh i'm not counting in hugging phase because they do have a really awesome library of transformers but they are like more for like as a black box experience type of usage and not for going through the code and understanding how stuff works so the two resources i mentioned are the annotated transformer is the first one and then pythort's official implementation now the thing with pytorch official implementation is that like their multi-headed attention module is awesome like functionality wise it's works but the thing is it's so generic because they gotta cover a bunch of use cases and so it's so complicated to actually go through the code and understand how it works so my idea was to just focus on the specific use case and uh like of this transformer model from 2017 and that's it on the other hand the annotated transformer block is basically suited only for researchers and uh also had a couple of bugs it's a it's an overall really nice resource and it helped me a lot but it's really only for researchers and they also like kind of uh suppose you you you understand the source code of torch text which is uh pythort's uh like uh module for manipulating nlp tasks which is usually not not the case right so so i wanna i wanna make a couple of points before we go and take an overview of the code and i explain you how i struggled which problems i had how i approached it i won't get into every single detail because the code base is now already pretty big i'll show you some details but before that i want to tell you that making good projects takes time so i had i had a couple of comments i had rahul commenting uh thanks for telling us that understanding and coding a new ai algorithm takes a few months i always felt like people are going too fast and doing things in just days and i cannot keep up but thank you it gives me motivation so thank you rahul uh the the the thing is i i mean it's funny like uh my first so my first attempt at trying to reconstruct a research paper took three months so i was coding three months non-stop and the reason being the authors did not open source their like implementation and you couldn't find the code wasn't as popular as the transformer so i couldn't find snippets of code anywhere so what i had what i ended up doing is just like trying and reconstructing the the the paper by just reading text which is super hard and it takes time and i had lots of back and forth with the authors to understand certain big details and it's like so the point i want to make here is don't fall for that learn topic huge topic x in five days i mean that doesn't work like that and i'm just wondering i think i know why that's the case like you have you have siraj raval who had videos like learn tensorflow in five minutes and i mean if you i want to tell you something i mean if you if you believe that crap you're in a really bad position uh because that's going to set you up for failure and why because you're basically it's basically setting up setting you up for impatience whereas you really truly need patience in order to create anything great like every single project like the projects i'm doing at microsoft they take that's like a multi like multi like team effort that lasts at least six months one year two years three years so anything significant takes time in artificial intelligence in computer science whatever the topic is like i have skipped uh creating a video last week last weekend because i was so in a flow of coding this transformers thing up that i didn't want to break my flow and uh create a video so that's that's how things function you got to make trade-offs sometimes and when we are at this point of talking about time management let me let me let me walk you through how my um like commit history looked like and how my how i organize my time during this three weeks okay so let me let me walk you through my schedule because i mean everybody's like uh i've noticed a lot of comments people thinking everybody is so much faster than they are and that's so much so not true it just takes so much time and let me let me walk you through my commit history if i open this up uh and let me find some representative day like november 5th and okay so let's start like i usually start my day uh around around like uh 9 20 9 30 a.m and so i code until maybe 11 15 so usually around two hours every morning and then what i do i take a short break and like take a stroll uh get something to eat and i start working at microsoft where i also work full-time so it's a really tight schedule and after i finish my my daily uh daily work at microsoft where so the good thing about microsoft and that's really awesome is that it's really it's a truly like a meritocracy system so my managers he doesn't care if i'm like working 10 hours or or six hours he cares that i get my done and that's that's super rewarding i'll get fired because of this video i'm joking so so basically uh uh i do work around eight hours but the the good thing is they they value what you do and how productive you are and how much you sit down in the chair doesn't matter you you get done and that's it so after i finish my my daily work i like usually take a power nap and depending how tired i am i may be whatever i take some more more more time to to do like rest and then around usually around like 9 9 p.m like 8 30 whatever um i i get working i get to to to coding again and so i work until maybe 11 p.m 11 30 so that's again like two three hours of quality coding so that's that's like that that thing accumulates so that's like four or five hours every single work day and then i have weekends where i'm even more productive because i don't have to work at microsoft on weekends and at the end of the day uh during work days i was usually either like i usually take some time to chill i either watch like south park to just get my like just relax my brain or uh recently over the last week i've been just watching lex friedman's podcast and there is just so many interesting people there and just kind of also relax me even though i'm learning by watching that podcast as well so that's that's cool and now if somebody told me like uh even two years ago like about the schedule i would tell like i would tell that that person like you're you're like nuts and the thing is depending on your mindset and where like your phase of life and like your situation this may not sound feasible but like that's what i thought like two years ago and then things change and it's totally fine so don't feel any pressure for not working as hard as somebody else because everybody's got their own pace and that's that's super fine like it took me 12 years to get and i'm working to do to acquire working habits that i have right now and initially i was just like learning different stuff on my own and structuring my journeys but they were not related to programming whatsoever so i was like learning about human languages i'm really passionate that was my of my that was one of my main hobbies when i was in high school and beginning of my faculty i was training bunch of sports i tried so many sports out there so i was just exploring and only at 19 did i get exposed to to programming and that's totally fine you can still achieve things even when you start a bit later like when you're 19 that's that's super totally fine like one more thing and i know it's kind of cliche but like life's not fair so like bill gates back in the 60s had a computer when he was like 12 or 13. and i mean uh he he started programming when he was 12. and it's still i'll say i'll tell you something it still doesn't matter you can start when you're 19 20 25 30 and it's still not late if you keep up the good work and you keep like being consistent and whatever you do you'll get better and you'll achieve you'll you'll get where you want to be and that's it i mean chillax and lastly there is one more thing i want to mention and that's uh failures and like turn your head around you and look at successful people around you and they all have failures and if they're honest enough they'll tell you about those if they're not well so i'll link this guy like he's a stanford phd and uh he he was open enough and honest enough to share his uh his journey and that's how pretty much everybody's journey looks like you get a lot of failures in his case that's like paper rejections but then you have a couple of successes and everybody knows you for your successes and that's it and like me personally i had like bunch of failures like i was rejected by facebook i was rejected by nvidia i was rejected by microsoft and then my second attempt i got like uh accepted into microsoft but even if it was like third attempt or fourth attempt it doesn't matter like as as long as you like keep learning and you keep applying that's you're set up for success if you're just applying then you're just spamming people like you got to keep on learning and applying it wherever you want to be you'll get there like sometimes when you get rejected it doesn't have to be that you're you're bad like i don't know like hiring pipelines are in big companies in big tech companies are not perfect they're far from perfect enough about motivational speaking hopefully i just wanted to share my personal story and some of details i i think many youtube videos out there are doing similar things there's already so much high quality educational resources and there is too little personal stories that i felt like sharing some some of the details from my life and from my workflow and i hope you found this helpful if you did please leave a comment in the in this comment section uh because i'll start creating more of these if you if you found it useful okay all of that being said uh let me kind of explain you how i think about approaching a new problem a new project so a good thing about a transformer project and in general about many software projects is that they can be modularized and what i mean by that is you can basically orthogonally i independently develop certain components and not care about others so this project was pretty much split into two parts so the the first one uh takes care of training the model and then you have the second part where once you've trained the model uh you wanna do uh you wanna do uh translations and you you wanna do some decoding uh on those models so once once you split uh like the functionalities like that you can focus on pretty much basically only only on the training side and i had only three files let's let's say three files three main files uh which i had to develop in order to get like the transformer model uh trained and the first one is obviously like the transformer model pi file which contains the architecture the definition of the model itself and uh so the good thing is when i said it's orthogonal so you can basically i can i basically just developed uh this model and then i created a small main function inside of this python file uh where i could test whether the model is working or not before i move ahead and develop the other parts of the project so it's never linear like that you you usually jump a little bit here a little bit there but let's say i basically first developed this this model and then and then i went and did the other things so i figured out some bugs doing this which is a good thing uh better catch them sooner than later and i'll i'll get into problems a bit later now let me just give you an overview of the the project so the second thing is second important thing is like data and you want to load the data you want to load like in this case in this case so the transformer model i should have mentioned that is is trained for the machine translation task so basically we want to translate from one language which is called a source language into another language which is called the target language and the languages i used were english and german or german to english vice versa so i had lots of problems with loading data uh in in pi torch first of all because i was usually i'm used to computer version uh projects and this was natural language processing so it was kind of a little bit of of paradigm shift and second of all it's nowhere as near as near as good as storage vision is for for computer vision so uh i stumbled across two significant problems first one was like functionality-wise so the thing didn't work as i expected to do like a bucket or something called bucket iterators which were supposed to batch the data in an optimal way weren't working as i was expecting them to and the second problem is the performance problem so the like tokenization procedure and everything took a lot of time like every every time i wanted to run the training loop it took like 45 seconds and that's so annoying and i kinda i think i went from you know i went from 66 seconds to like two and a half seconds which was a massive optimization um jump and they helped me trade much faster and that's super important in deep learning and machine learning in general like you want to have a like a really short iteration cycle that that's that's that helps you build stuff better and faster so i'll i'll tell you about some of the problems i had developing uh this uh like the data pipeline uh but first as i mentioned let me just quickly go through other files so then we have like the training loop and basically uh the training loop is pretty simple once you take a look how so let me let me show you uh so basically yeah you i load them i prepared the model i prepared the data uh then uh what i do is i just prepare some things like uh for label smoothing i won't get into details what that is like optimizers and then the loop is really simple so for a number of epochs i used 20 because um just looking at the the attention is all you need paper uh and calculating stuff you can see they approximately had 19 epochs on the wmt 14 data set so that's what i did also and i didn't get into any situation regions until 20 20 bucks it was totally fine and so iteration loop so i just um i basically just have so i do the training loop and then i do the validation loop because you want to see how the model is generalizing to data that it hasn't seen during the training so training loop is pretty so i'm already getting into too much details so you basically once i set a training loop i can i can i can run the whole thing i can test it and then uh what i did is i quickly made a translation script which would test whether the model was doing what it's supposed to be doing and that's uh like translating from language to language and that's super important like you want to have like end-to-end system uh as soon as possible and then you can keep on iterating and improving either the model or the data loading and that's how i did it so i first had the whole thing working end-to-end and only then i did those optimizations etc i also had to develop decoding algorithms so what i did is i i developed a simple one that's called like greedy decoding where basically so once you get the like the output distribution for output distribution for us for the next token uh you basically figure out where the highest probability is and that position corresponds to a certain token your vocabulary so basically you you find the next token in your output or target sentence uh in that way and greedy is the simplest method uh what they used in the paper was actually something called beam search uh where you keep like uh like a like a number of hypotheses running in parallel uh and you at the end you're left with the with the with the hypothesis that's like most probable and you keep that one and you output that one as the output sentence uh finally i also have something called playground file and i do that not only because i'm open sourcing this and treating this like as a learning resource for others but it's also a really good thing for me because it helped me understand something a lot a lot better and because it's much easier i'm a highly visual person so i like to visualize stuff so for example there is this thing called if you're familiar with the transformer model uh they they have something called positional encodings which you basically add to tokens uh uh embedding vectors and so the the formula is kind of complicated and then once you visualize it it's not as complicated so here is how it looks like for like so you basically take one row from this image and so that that's like a vector of numbers and you add that to uh to the token at this particular position so row zero would be added to token representations at position zero row one would go to token representations at position one etc so that was the like the high level overview of how i think about the project when i'm developing the project so i like to separate functionalities i like to get uh end-to-end system working as soon as possible and then i kind of go into depth so that's the same way i learned theory i always start high level so first like create a skeleton and then start filling in the gaps this the same i pretty much do the same thing i use the same strategy when i develop like uh software projects so that was a high level overview of the project now just one i want to give you like a a couple i want to like kind of tell you about a couple of problems i struggle with so one of them is when i was developing the the transformer model itself uh so it's always a good thing to kind of print out the layers you have in the model to print out the shapes and in general just to see how how many parameters the model has so doing that i figured out that my multi-head attention was actually i was i was referencing the same object in memory in multiple encoder layers and there was simply a bug which i discovered because i was i printed out these like layer layer names and i saw there there were some uh like multi-hat attention layers missing also so basically if i change this to true here because the the the paper explained they they had the base model and they had the big model so if i put the big one and i run the script i'll get i'm printing out the number of parameters and you can see here like 176 million parameters and that's also a good uh thing to do because i knew that the big model had 175 million parameters so this kind of confirmed that i'm that i have the same number of parameters as the paper stated so those are just some of the things you want to do because you you want to make sure that something is correct like it's kind of it's a kind of a test pretty much um yeah so the second fun experience i had was during uh when i was developing uh beam search uh functionality and i actually still haven't finished it yet i i have really tested um but beam is still not working and while i was uh kind of uh investigating how other people did it i i found like in tensor210 uh tensor2tensor uh library i found like a line like this like length penalty equals t f uh power five plus something divided by six uh power to the alpha raised to the alpha uh power and i was like what the so the thing is i i beg you if you ever write code in a like a public open source library please leave a comment it will make like live so much easier so i i don't know how but i like kind of went through the code and some other people's codes like open nmt etc and i figured out there is a paper which introduced this uh esoteric uh heuristic and it kind of just works i mean and you can see it here five plus blah blah raised to the power of alpha and i was like what the so those are just i just wanted to give you a glimpse of things you you encounter while while you're working on these projects so it's it's super interesting super funny stuff you can you can see okay this video is getting already uh a bit too long uh let me just show you one more interesting problem i stumbled upon while uh working on the transformer project so it has to do with torch text and loading data and basically you have this class in torch text which is called a bucket iterator and it's something similar to data loader if you if you're familiar with computer vision like torque vision library and what happened basically is that uh let me see if i can find uh so so what happened here is that uh like the documentation said something totally different from the results i was getting so basically what the bucket iterator said is uh minimizes amount of padding needed while producing freshly shuffled batches for each new epoch and that's not true like uh i had to dig so deep into the source code and finally i submitted like uh i saw there is al already some issue pending and i just commented and like kind of said yeah that guy had right he stumbled upon the same bug as i did and basically you have to set this sort within batch to true otherwise you won't get like uh padding minimized and once you do that everything so so then you get the functionality you actually wanted and then there was a second thing so uh what i wanted to do is because certain sentences are much longer than other so if you if you just set the batch number as a fixed number so for example always load eight sentences sometimes the bat will be huge sometimes it will it will be really really like small and that's a bad thing because you want to maximize the amount of vram on your gpu that you're using so what i had to do is i had to to write down this uh custom function uh which will which will basically uh make sure that i always have uh like the same number of approximately the same number of tokens in a batch and uh the annotated transformer likely had something similar implemented although they didn't comment why this was like uh why well why why this was there and so i struggled a lot until i actually understood the source code and then i figured every everything made sense so now hopefully if you go through my code you it will be a bit easier to understand what happened here just one more quick thing uh that's that i had to add this uh basically custom data set because uh using spacey tokenizers at the moment it took me 66 seconds 70 seconds whatever to to re-tokenize the the the source and target sentences every single time and so i just decided why not tokenize it once save it as a like simple uh txt file and then instead of re-tokenizing things again just load the txt file and that now takes like two and a half seconds so in a nutshell that's what i did uh with these dataset wrapper and this fast translation dataset uh wrapper uh class so that was a short story about like some problems i encountered now like for the end uh let me just show you how the thing functions and how the translation script actually works and i think it's really interesting and magical for me because i was usually doing computer vision things and then once you see this thing actually working and translating from english to german and vice versa it's pretty fascinating if you ask me so i've got a so you basically just have to input the sentence depending so i have an english sentence how are you doing today whatever uh like uh set as a source sentence and we have uh the model that was trained on i w slt data set and it's translating from english to german so the name is pretty indicative of what the model is doing and where the model was trained on so you just have to keep these two in sync with the model so if you change the model to g to e german to english you you would just have to set the the enum here to g2e instead of e to g and that's it once i start this thing let me run it it takes a little bit of time uh two and a half seconds to load the the like the the the the vocabularies and what to load vocabulary and then it'll just translate why is it not working okay here it is uh it took two and a half 2.6 seconds to load the data and then you can see uh the input sentence the source sentence tokenized uh and you see the output sentence we get the scene in hoyte which means the same thing pretty much uh although enon is a german word like like a polite way to say you so there's nothing stopping me from translating from german to english let me just change this thing to german to english let me change the model to german to english and now if i input uh for example viggetastia which translates as yeah how are you basically in english let me see what the model will say how the model would translate this this time and it basically translated this as how are you which is pretty much something called gold translation which means like ground truth translation uh this looks pretty pretty decent uh the model is not perfect you can find some sentences which would uh kind of confuse the model so even a simple sentence like uh ich bin i'm berliner which is a sentence like i think kennedy said when he was back in berlin back in the days um and it means i'm a berliner i'm a i'm a person living in berlin and uh if i input this and run it so the model will actually uh the thing is uh the model doesn't have a token berliner uh inside a vocabulary so that's why we'll say berlin instead of c i mean i'm entrepreneurizing the model here so it will uh output berlin so i am a berlin uh so if the model had berliner it would output berliner so that's another another point i want to make here so if i had something called byte pairing coding i could the model would be much more expressive in creating these words uh than at this point of time but that's a future work and i'll let it hopefully like maybe next week or something so there are many other things i could show you but for the sake of this video not being like one hour long uh let me just show you how the perf uh what amount of hardware requirements you need in order to run translations uh not in order to train the model you'll you would need a lot of gpus to train the model i actually had have some comments in the readme please go go ahead and read that part but for like just for for translations let me just open the task manager and if i open it up if i run the script now we can follow how much gpu uh is needed to to get the translation so it's so it's basically only a small spike uh so you won't need as much hardware uh like like a really fancy gpu if you're just running translations on the other side if you're training this model you'll need a lot of power so that that pretty much wraps it up i hope you like this video there was a lot of information in in here uh so what my next steps will be is i i'll start reading research papers again i want to explore what's state-of-the-art in natural language processing uh now in 2020 so i'll be spending like maybe reading uh one paper in the morning one paper in the evening that amounts to maybe at least 10 papers a week which is good enough to get me like uh familiar with the field in like uh 10 days period two weeks so yeah that was pretty much it if you like this video go ahead and subscribe to this channel and hit the bell icon to get notified when i upload a new video and until next time keep learning

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video, I talk about how I work on my deep learning projects on the example of the transformer which I've recently open-sourced. You'll learn about: ✔️ Snippets of my personal story ✔️ How I think about approaching a new project ✔️ Problems I encountered developing the transformer ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ My transformer implementation: https://github.com/gordicaleksa/pytorch-original-transformer ✅ The Annotated Transformer blog: http://nlp.seas.harvard.edu/2018/04/03/attention.html ✅ Original paper: https://arxiv.org/abs/1706.03762 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 0:00 Open-sourcing original transformer and why 3:00 Creating great projects takes time 5:40 My time management 9:00 My story and embracing failures 12:05 Overview of the transformer project 14:03 Data and task definition 15:57 Training loop 20:10 Problems I encountered 21:44 Beam search fun 23:10 BucketIterator fun 25:35 Optimizing things to speed up the loop 26:25 Translating from English to German and vice versa 30:00 Hardware requirements 30:47 Wrapping things up ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 20 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

The video teaches how to develop a deep learning project using a transformer model, including implementation, training, and optimization. It also shares the creator's personal story and experiences, emphasizing the importance of learning, applying, and being consistent for success. The project demonstrates the use of PyTorch, torch text, and other tools for machine translation tasks.

Key Takeaways

Develop the transformer model architecture
Create a main function to test the model
Optimize the tokenization procedure
Develop a data pipeline for loading and batching data
Configure the training loop
Run the training loop
Configure the validation loop
Write custom function to ensure approximately the same number of tokens in a batch
Use custom dataset to speed up tokenization
Translate English to German and vice versa using a trained model

💡 The transformer model can be optimized for machine translation tasks by using techniques such as label smoothing, beam search, and greedy decoding.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Embeddings Simplified

Learn the basics of embeddings and how they simplify complex data, a crucial concept in AI and ML

I built a tool that cuts Claude/ChatGPT token usage by 97% — here's how it works

Learn how to build a tool that reduces Claude/ChatGPT token usage by 97%, increasing productivity and efficiency in debugging and development

Dev.to · Rohith Matam

Serverless AI in a Browser Tab: Java WebAssembly + Local WebGPU LLMs

Learn to build a serverless AI model in a browser tab using Java WebAssembly and Local WebGPU LLMs for a zero-infrastructure RAG architecture

Dev.to · vishalmysore

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

Chapters (14)

Open-sourcing original transformer and why

3:00 Creating great projects takes time

5:40 My time management

9:00 My story and embracing failures

12:05 Overview of the transformer project

14:03 Data and task definition

15:57 Training loop

20:10 Problems I encountered

21:44 Beam search fun

23:10 BucketIterator fun

25:35 Optimizing things to speed up the loop

26:25 Translating from English to German and vice versa

30:00 Hardware requirements

30:47 Wrapping things up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)