Sequence Models with Pujaa Rajan

Weights & Biases · Beginner ·🏭 MLOps & LLMOps ·6y ago

Skills: ML Maths Basics80%Supervised Learning70%Unsupervised Learning60%ML Pipelines50%

Key Takeaways

The video discusses sequence models, including RNNs, GRUs, LSTMs, and BiRNNs, with a focus on their applications and limitations. Pujaa Rajan, a deep learning engineer, presents the concepts and techniques used in sequence models, including named entity recognition and model interpretability.

Full Transcript

hello everyone welcome to my technical talk today about sequence models here's a preview of what's to come first we'll talk about how a neural network works so also known as an RNN second we'll zoom in lopsang zoom zoom feels a little bit awkward but we'll zoom in on the units of RN ends LST ends and gr use last we'll look at more advanced RN ends like the bi-directional RN N and D bar NN so I'm feeling funny so let's start let's start with a joke explaining why sequence models are useful a human asks what do we want and a computer answers natural language processing I don't know about you but I don't think natural language processing is in the top three things I want right now maybe not even my top ten but anyhow let's go is it the human next asks when do we want it this is obviously a trick question because humans even with our relatively long attention spans want everything immediately by this point in the by this point in the conversation the computer has already forgotten what we were talking about in us what do we want what again so that's the end of the joke I wish soom had a laugh track or something to play here the takeaway here is that a traditional neural network model can't follow dialogue or sequences of text well which is why sequence models were invented the basic sequence model is a recurrent neural network so we'll start with that one the GRU LS TM by RNN and deep RNN all of which we will be learning about next are extensions of this basic RNN architecture let's start with an example imagine using an RNN for a named entity recognition problem like recognize whether a word in a sentence is a person's name in this example the first word in the sentence or dataset would be the input to an RNN this first word is then passed to a hidden layer we also pass in an initial vector that's all zero or randomly initialized to the same hidden layer inside the hidden layer math happens and outputs an activation to pass on to the next hidden layer we'll zoom into the math later when we learn about the recurrent neural unit next multiple parameters are calculated then more math happens and the first word input vector of zeros and parameters are used to output a prediction in the example of named entity recognition the output would be 0 or 1 0 if the word was not + 1 if the word was a person's name the activation parameters which includes information about the first word are then passed into the next hidden layer and are used for predicting whether the second word is a person's name we repeat this pattern for the third word and as you can see a sequence is starting to form we repeat this pattern for the number of words in the sentence alternatively you can also forcefully stop the sequence model by adding a maximum number of time steps T can be the number of words or the maximum number of time steps allowed this is the basic architecture of a neural network now we will see how back propagation and forward propagation works for every prediction a loss function calculates the loss the purpose of this is to optimize the parameter values in this sequence these losses are propagated forward to inform future output predictions the losses are not only propagated for words but also propagated backwards like this now this here is a unit which is also known as a neuron in the hidden layer alright you made it to the first Star Wars gif I hope you guys are Star Wars fans because there's more Star Wars gifts coming up next we'll look at what the mouth looks like inside one of these recurrent neural units here's what the architecture of a unit or neuron inside the hidden layer of an RN looks like the activation value from the previous time step and the word vector at the current time step are inputs to the activation function which is then mapped to zero or one output using softmax before being output as the prediction here's the math showing how the activation at T is calculated before it's passed on to the next hidden layer G is the tan aged activation function and W is the parameter matrix of the previous activation value and the current input value lastly a bias B is added to the output I know that was a lot of technical details in the last few slides we're about halfway done the next half will still be technical but we'll be talking about higher-level differences in the art and architecture for different use cases let's start with an example here imagine using an RNN for a sentence generation problem maybe like the classic write like shakespeare one if you're familiar with that we need to remember the subject of the sentence to decide whether to generate a plural or singular verb next gr use or grooves are better at this than RNs grooves are used in place of RN and units in our nuts this makes crews not only better understanding longer range dependencies but also solves our NS vanishing grading a problem which I'll talk more about at the end of this video we're at the end of this webinar the grooc is made up of the memory cell which memorizes relevant information and two gates first the update gate decides what information should be memorized or forgotten and second the reset gate decides how much of the past information to forget that was the way a grue works the same idea of gates from Gru's are also used in alice cams which we'll be learning about next Alice he owns like crews learn long-term dependencies in a sequence Alice teams have three gates so that's one more than a group the Alice TM is made up of a memory cell which memorizes relevant information and three dates first the forget gate decides what information should be memorized or forgotten second the input gate updates the memory cell if the information is relevant and third the output gate decides what the next hidden state should be using an LS TM you can develop a neural network that understands more complex sequences of text it's hard to predict whether it grew or LST m will perform better so it's often best to try both coming up next our bi-directional our nuts which are bi-directional like this light saber here bi-directional RMS let you use information from the beginning and the end of a sequence let's think back to the example of named entity recognition sometimes you need contacts from not only before but also after the word to decide whether the word is a person's name the one directional RNN only has forward recurrent layers and reads the sentence from left to right a bi-directional RNN has forward and backward recurrent layers allowing it to read from left to right and right to left as a result this model uses the past present and future information when making a prediction by the way the a hidden layers here that you see can include traditional RNN GRU and LS TM units that was a by are done thinking forwards and backwards now let's talk about how to take all our nuns LST ads and grooves and construct deep versions of them the our nun grew in LST M you've learned about so far already work well as is but sometimes Mack stacking multiple layers of RN ends together to build deeper versions of these models perform better here's the standard RN n that you've seen so far then we can just stack more layers on top now this is a new network with three hidden layers by the way these hidden layers don't need to use the simple RN n units we saw at the beginning of today's talk they can also use grew and LST M units and if you were wondering it's possible to build a deep version of the bi-directional RN n 2 now that you know how RN ends Gru's and LS teams work let's look at the advantages and disadvantages of these options the traditional RN n also known as a vanilla RN n is a good model to start with because it gives you a good baseline to compare the other models with it however RN ends often face vanish ingredients problems this is when the gradient diminishes dramatically as it's propagated backwards the error might be so small that it might have little effect by the time it reaches the layers close to the input of the model which is why it's absolutely named the vanishing gradients problem the group fixes this problem because it's gates control the flow of information inside the network more effectively but there's a trade-off between the speed and power of note of the network for grooves and LST ends while groose can be better at longer sequences because of its additional gate it's still slower than the grew because by our nuns use information from the past present and future which is a good thing because it gives the model more context you need access to the whole sequence of data before you can make predictions anywhere so it's a double-edged sword this can be inconvenient for example when you're building a speech recognition system since you will have to wait for the person did you don't stop talking before you can make a prediction it's still a good option for most natural language processing applications where you have access to the whole sentence at once deep are nuns hierarchy of hidden layers enables more complex understanding at the data but it's also more computationally expensive than other options so that's all the pros and cons folks congratulations you've now added the RNN recurrent neural unit grew LST m by RN and deep RN a-- into your toolbox to use when creating sequence models I'm looking forward to seeing what you build using these new tools thank you all for listening my name is pooja pooja Rajan I'm a deep learning engineer at node and the USA ambassador for women in AI follow me on Twitter and check out my website and feel free to follow up and say hi thanks blue ship that was great oh my god I'm gonna drop your Twitter in the chat just in case people wanna follow you okay I'll start wearing then cool if you have some questions to someone asked how do you think about how many layers to build can you repeat the question how do you think about how many layers to build some users in your network yeah that's a good question um it's honestly trial and error I start with the simplest just because I want the fastest output right if you start with something that's like unnecessarily like a hundred layers deep it's just taking you longer to iterate so I usually start with one two and kind of go up from there just a sidebar though I also look to see whether someone else has kind of created something similar online because you can learn a lot from other people's experiments like depending on what type of model you're trying to build because at that point you know you don't want to try rerunning or you don't always want to rerun something that someone's already done so if you know that like you know that you'll need at least ten layers for your particular problem then you can start there based on previous research that's a great suggestion so I she asked can you share some links where you can begin to learn about NLP so blogs or anything else also I'm gonna drop this this dog course is by Andre karpati and he talks about how to start from a really small model and build it out so the add layers and stuff I highly recommend reading this if you want to definitely check it out guys and sorry the question was what are your favorite resources to learn about mmm to learn about NLP mmm okay ooh a good one is this blog called ruder IO so are you de are die oh it's not my blog I wish it was but it's it's Sebastian Reuters blog he's a NLP scientist and he has some really good writing on natural language processing but just to get started I I recommend checking out um just there's like an NLP video I think by there's definitely one by and runing on sequence models I really recommend that one it should be on Coursera okay Charles that you can drop that link for us that would be great thanks so someone else then and you then would you to implement a month from scratch would you ever do that yeah again it depends on your use case my philosophy is usually you can use something that already exists to get you maybe like 80 90 percent of the way um it's definitely a lot cheaper timewise because you can get it going much faster but let's say you need to add some customization to it let's say you're working in a real novel space where it's a really specific problem that a generalized model won't work for or another example more recently that I faced was I was trying to add multiple interpretability capabilities so that's like integrated gradients check out pie charts captain if you haven't heard of it but basically it's it makes it difficult to make your models interpretable if you're using someone else's architecture and you can't really add in the extra math of extra explained ability features so actually this is a good time to tell you guys push is actually gonna do with it whole series so this is her first time but she's gonna be two more talks at the next two salons about machine learning explain ability so if you're interested in that you should definitely come to the next one trying to see if there's more questions what not functions are effective for these kinds of networks the loss function also kind of depends on your use case it's a bit tough to say a classic kind of starter one that if you're if you're not familiar like what to use at all I recommend just going with like and I guess it really depends on your use case but maybe like cross centerpiece like a good one for some things it really depends on what you're trying to penalize and how much you're trying to penalize it so that's a really broad question I'm not sure Lavanya do you know any resources off the top of your head for just like comparing and contrasting loss functions yes I will drop it in and have to give to me notes this is really this guy he does the deepest dives on these like basic parts of your networks and most people don't pay attention to but I dropped it I think those are other questions we'll see you again next week I guess I'm super excited yeah thanks for having me here guys

Original Description

Pujaa Rajan's presentation is on sequence models: RNNs, GRUs, LSTMs, BiRNNs, and Deep RNNs. Come to learn about sequence models and stay for the Star Wars gifs. Pujaa is a deep learning engineer at http://Node.io, and the USA Ambassador at WomeinAI.co. She previously worked at BlackRock and graduated from Cornell. 👩🏼‍🚀Weights and Biases: We’re always free for academics and open source projects. Email carey@wandb.com with any questions or feature suggestions. - Blog: https://www.wandb.com/articles - Gallery: See what you can create with W&B - https://wandb.ai/fully-connected - Continue the conversation on our slack community - http://wandb.me/fs

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 50 of 60

← Previous Next →

0. What is machine learning?

0. What is machine learning?

Weights & Biases

1. Build Your First Machine Learning Model

1. Build Your First Machine Learning Model

Weights & Biases

Intro to ML: Course Overview

Intro to ML: Course Overview

Weights & Biases

2. Multi-Layer Perceptrons

2. Multi-Layer Perceptrons

Weights & Biases

3. Convolutional Neural Networks

3. Convolutional Neural Networks

Weights & Biases

Weights & Biases at OpenAI

Weights & Biases at OpenAI

Weights & Biases

Why Experiment Tracking is Crucial to OpenAI

Why Experiment Tracking is Crucial to OpenAI

Weights & Biases

4. Autoencoders

4. Autoencoders

Weights & Biases

5. Sentiment Analysis

5. Sentiment Analysis

Weights & Biases

6. Recurrent Neural Networks [RNNs]

6. Recurrent Neural Networks [RNNs]

Weights & Biases

7. Text Generation using LSTMs and GRUs

7. Text Generation using LSTMs and GRUs

Weights & Biases

8. Text Classification Using Convolutional Neural Networks

8. Text Classification Using Convolutional Neural Networks

Weights & Biases

9. Hybrid LSTMs [Long Short-Term Memory]

9. Hybrid LSTMs [Long Short-Term Memory]

Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Weights & Biases

Weights and Biases - Developer Tools for Deep Learning

Weights and Biases - Developer Tools for Deep Learning

Weights & Biases

Introducing Weights & Biases

Introducing Weights & Biases

Weights & Biases

10. Seq2Seq Models

10. Seq2Seq Models

Weights & Biases

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

Weights & Biases

12. One-shot learning for teaching neural networks to classify objects never seen before

12. One-shot learning for teaching neural networks to classify objects never seen before

Weights & Biases

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases

14. Data Augmentation | Keras

14. Data Augmentation | Keras

Weights & Biases

15. Batch Size and Learning Rate in CNNs

15. Batch Size and Learning Rate in CNNs

Weights & Biases

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Weights & Biases

Grading Rubric for AI Applications with Sergey Karayev (2019)

Grading Rubric for AI Applications with Sergey Karayev (2019)

Weights & Biases

16. Video Frame Prediction using CNNs and LSTMs (2019)

16. Video Frame Prediction using CNNs and LSTMs (2019)

Weights & Biases

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Weights & Biases

17. Build and Deploy an Emotion Classifier (2019)

17. Build and Deploy an Emotion Classifier (2019)

Weights & Biases

Applied Deep Learning - Data Management with Josh Tobin (2019)

Applied Deep Learning - Data Management with Josh Tobin (2019)

Weights & Biases

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Weights & Biases

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Weights & Biases

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Weights & Biases

Designing a Machine Learning Project with Neal Khosla (2019)

Designing a Machine Learning Project with Neal Khosla (2019)

Weights & Biases

Lukas Beiwald on ML Tools and Experiment Management (2019)

Lukas Beiwald on ML Tools and Experiment Management (2019)

Weights & Biases

Building Machine Learning Teams with Josh Tobin (2019)

Building Machine Learning Teams with Josh Tobin (2019)

Weights & Biases

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Weights & Biases

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Weights & Biases

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Weights & Biases

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Weights & Biases

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Weights & Biases

Organizing ML projects — W&B walkthrough (2020)

Organizing ML projects — W&B walkthrough (2020)

Weights & Biases

Brandon Rohrer — Machine Learning in Production for Robots

Brandon Rohrer — Machine Learning in Production for Robots

Weights & Biases

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Weights & Biases

My experiments with Reinforcement Learning with Jariullah Safi

My experiments with Reinforcement Learning with Jariullah Safi

Weights & Biases

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Weights & Biases

Testing Machine Learning Models with Eric Schles

Testing Machine Learning Models with Eric Schles

Weights & Biases

How Linear Algebra is not like Algebra with Charles Frye

How Linear Algebra is not like Algebra with Charles Frye

Weights & Biases

Predicting Protein Structures using Deep Learning with Jonathan King

Predicting Protein Structures using Deep Learning with Jonathan King

Weights & Biases

Rachael Tatman — Conversational AI and Linguistics

Rachael Tatman — Conversational AI and Linguistics

Weights & Biases

Reformer by Han Lee

Reformer by Han Lee

Weights & Biases

Sequence Models with Pujaa Rajan

Sequence Models with Pujaa Rajan

Weights & Biases

GitHub Actions & Machine Learning Workflows with Hamel Husain

GitHub Actions & Machine Learning Workflows with Hamel Husain

Weights & Biases

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Weights & Biases

Jack Clark — Building Trustworthy AI Systems

Jack Clark — Building Trustworthy AI Systems

Weights & Biases

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Weights & Biases

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Weights & Biases

Antipatterns in open source research code with Jariullah Safi

Antipatterns in open source research code with Jariullah Safi

Weights & Biases

Attention for time series forecasting & COVID predictions - Isaac Godfried

Attention for time series forecasting & COVID predictions - Isaac Godfried

Weights & Biases

Made with ML - Goku Mohandas

Made with ML - Goku Mohandas

Weights & Biases

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Weights & Biases

Deep Learning Salon by Weights & Biases

Deep Learning Salon by Weights & Biases

Weights & Biases

This video teaches the basics of sequence models, including RNNs, GRUs, and LSTMs, and their applications in NLP tasks. It also covers model interpretability and provides resources for further learning. By watching this video, viewers can gain a solid understanding of sequence models and how to apply them to real-world problems.

Key Takeaways

Pass the first word in the sentence to a hidden layer
Calculate multiple parameters inside the hidden layer
Output a prediction using the activation function and parameters
Repeat the pattern for the number of words in the sentence
Forcefully stop the sequence model by adding a maximum number of time steps
Use GRU and LSTM units to solve the vanishing gradient problem

💡 Sequence models, particularly RNNs and LSTMs, are useful for sequence-based tasks, but they can suffer from vanishing gradients. GRUs and LSTMs can help mitigate this issue.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

DevOps Took 10 Years to Mature.

MLOps is distinct from DevOps and solves unique problems, requiring a different approach

Medium · DevOps

Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI

Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance

Medium · DevOps

Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx

Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development

Dev.to · Shannon Dias

MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages

Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation

Pole Pruner How A Rope Lever Shears High Branches

Innoforge Studio