Sequence Models with Pujaa Rajan

Weights & Biases · Beginner ·🏭 MLOps & LLMOps ·6y ago

Key Takeaways

The video discusses sequence models, including RNNs, GRUs, LSTMs, and BiRNNs, with a focus on their applications and limitations. Pujaa Rajan, a deep learning engineer, presents the concepts and techniques used in sequence models, including named entity recognition and model interpretability.

Full Transcript

hello everyone welcome to my technical talk today about sequence models here's a preview of what's to come first we'll talk about how a neural network works so also known as an RNN second we'll zoom in lopsang zoom zoom feels a little bit awkward but we'll zoom in on the units of RN ends LST ends and gr use last we'll look at more advanced RN ends like the bi-directional RN N and D bar NN so I'm feeling funny so let's start let's start with a joke explaining why sequence models are useful a human asks what do we want and a computer answers natural language processing I don't know about you but I don't think natural language processing is in the top three things I want right now maybe not even my top ten but anyhow let's go is it the human next asks when do we want it this is obviously a trick question because humans even with our relatively long attention spans want everything immediately by this point in the by this point in the conversation the computer has already forgotten what we were talking about in us what do we want what again so that's the end of the joke I wish soom had a laugh track or something to play here the takeaway here is that a traditional neural network model can't follow dialogue or sequences of text well which is why sequence models were invented the basic sequence model is a recurrent neural network so we'll start with that one the GRU LS TM by RNN and deep RNN all of which we will be learning about next are extensions of this basic RNN architecture let's start with an example imagine using an RNN for a named entity recognition problem like recognize whether a word in a sentence is a person's name in this example the first word in the sentence or dataset would be the input to an RNN this first word is then passed to a hidden layer we also pass in an initial vector that's all zero or randomly initialized to the same hidden layer inside the hidden layer math happens and outputs an activation to pass on to the next hidden layer we'll zoom into the math later when we learn about the recurrent neural unit next multiple parameters are calculated then more math happens and the first word input vector of zeros and parameters are used to output a prediction in the example of named entity recognition the output would be 0 or 1 0 if the word was not + 1 if the word was a person's name the activation parameters which includes information about the first word are then passed into the next hidden layer and are used for predicting whether the second word is a person's name we repeat this pattern for the third word and as you can see a sequence is starting to form we repeat this pattern for the number of words in the sentence alternatively you can also forcefully stop the sequence model by adding a maximum number of time steps T can be the number of words or the maximum number of time steps allowed this is the basic architecture of a neural network now we will see how back propagation and forward propagation works for every prediction a loss function calculates the loss the purpose of this is to optimize the parameter values in this sequence these losses are propagated forward to inform future output predictions the losses are not only propagated for words but also propagated backwards like this now this here is a unit which is also known as a neuron in the hidden layer alright you made it to the first Star Wars gif I hope you guys are Star Wars fans because there's more Star Wars gifts coming up next we'll look at what the mouth looks like inside one of these recurrent neural units here's what the architecture of a unit or neuron inside the hidden layer of an RN looks like the activation value from the previous time step and the word vector at the current time step are inputs to the activation function which is then mapped to zero or one output using softmax before being output as the prediction here's the math showing how the activation at T is calculated before it's passed on to the next hidden layer G is the tan aged activation function and W is the parameter matrix of the previous activation value and the current input value lastly a bias B is added to the output I know that was a lot of technical details in the last few slides we're about halfway done the next half will still be technical but we'll be talking about higher-level differences in the art and architecture for different use cases let's start with an example here imagine using an RNN for a sentence generation problem maybe like the classic write like shakespeare one if you're familiar with that we need to remember the subject of the sentence to decide whether to generate a plural or singular verb next gr use or grooves are better at this than RNs grooves are used in place of RN and units in our nuts this makes crews not only better understanding longer range dependencies but also solves our NS vanishing grading a problem which I'll talk more about at the end of this video we're at the end of this webinar the grooc is made up of the memory cell which memorizes relevant information and two gates first the update gate decides what information should be memorized or forgotten and second the reset gate decides how much of the past information to forget that was the way a grue works the same idea of gates from Gru's are also used in alice cams which we'll be learning about next Alice he owns like crews learn long-term dependencies in a sequence Alice teams have three gates so that's one more than a group the Alice TM is made up of a memory cell which memorizes relevant information and three dates first the forget gate decides what information should be memorized or forgotten second the input gate updates the memory cell if the information is relevant and third the output gate decides what the next hidden state should be using an LS TM you can develop a neural network that understands more complex sequences of text it's hard to predict whether it grew or LST m will perform better so it's often best to try both coming up next our bi-directional our nuts which are bi-directional like this light saber here bi-directional RMS let you use information from the beginning and the end of a sequence let's think back to the example of named entity recognition sometimes you need contacts from not only before but also after the word to decide whether the word is a person's name the one directional RNN only has forward recurrent layers and reads the sentence from left to right a bi-directional RNN has forward and backward recurrent layers allowing it to read from left to right and right to left as a result this model uses the past present and future information when making a prediction by the way the a hidden layers here that you see can include traditional RNN GRU and LS TM units that was a by are done thinking forwards and backwards now let's talk about how to take all our nuns LST ads and grooves and construct deep versions of them the our nun grew in LST M you've learned about so far already work well as is but sometimes Mack stacking multiple layers of RN ends together to build deeper versions of these models perform better here's the standard RN n that you've seen so far then we can just stack more layers on top now this is a new network with three hidden layers by the way these hidden layers don't need to use the simple RN n units we saw at the beginning of today's talk they can also use grew and LST M units and if you were wondering it's possible to build a deep version of the bi-directional RN n 2 now that you know how RN ends Gru's and LS teams work let's look at the advantages and disadvantages of these options the traditional RN n also known as a vanilla RN n is a good model to start with because it gives you a good baseline to compare the other models with it however RN ends often face vanish ingredients problems this is when the gradient diminishes dramatically as it's propagated backwards the error might be so small that it might have little effect by the time it reaches the layers close to the input of the model which is why it's absolutely named the vanishing gradients problem the group fixes this problem because it's gates control the flow of information inside the network more effectively but there's a trade-off between the speed and power of note of the network for grooves and LST ends while groose can be better at longer sequences because of its additional gate it's still slower than the grew because by our nuns use information from the past present and future which is a good thing because it gives the model more context you need access to the whole sequence of data before you can make predictions anywhere so it's a double-edged sword this can be inconvenient for example when you're building a speech recognition system since you will have to wait for the person did you don't stop talking before you can make a prediction it's still a good option for most natural language processing applications where you have access to the whole sentence at once deep are nuns hierarchy of hidden layers enables more complex understanding at the data but it's also more computationally expensive than other options so that's all the pros and cons folks congratulations you've now added the RNN recurrent neural unit grew LST m by RN and deep RN a-- into your toolbox to use when creating sequence models I'm looking forward to seeing what you build using these new tools thank you all for listening my name is pooja pooja Rajan I'm a deep learning engineer at node and the USA ambassador for women in AI follow me on Twitter and check out my website and feel free to follow up and say hi thanks blue ship that was great oh my god I'm gonna drop your Twitter in the chat just in case people wanna follow you okay I'll start wearing then cool if you have some questions to someone asked how do you think about how many layers to build can you repeat the question how do you think about how many layers to build some users in your network yeah that's a good question um it's honestly trial and error I start with the simplest just because I want the fastest output right if you start with something that's like unnecessarily like a hundred layers deep it's just taking you longer to iterate so I usually start with one two and kind of go up from there just a sidebar though I also look to see whether someone else has kind of created something similar online because you can learn a lot from other people's experiments like depending on what type of model you're trying to build because at that point you know you don't want to try rerunning or you don't always want to rerun something that someone's already done so if you know that like you know that you'll need at least ten layers for your particular problem then you can start there based on previous research that's a great suggestion so I she asked can you share some links where you can begin to learn about NLP so blogs or anything else also I'm gonna drop this this dog course is by Andre karpati and he talks about how to start from a really small model and build it out so the add layers and stuff I highly recommend reading this if you want to definitely check it out guys and sorry the question was what are your favorite resources to learn about mmm to learn about NLP mmm okay ooh a good one is this blog called ruder IO so are you de are die oh it's not my blog I wish it was but it's it's Sebastian Reuters blog he's a NLP scientist and he has some really good writing on natural language processing but just to get started I I recommend checking out um just there's like an NLP video I think by there's definitely one by and runing on sequence models I really recommend that one it should be on Coursera okay Charles that you can drop that link for us that would be great thanks so someone else then and you then would you to implement a month from scratch would you ever do that yeah again it depends on your use case my philosophy is usually you can use something that already exists to get you maybe like 80 90 percent of the way um it's definitely a lot cheaper timewise because you can get it going much faster but let's say you need to add some customization to it let's say you're working in a real novel space where it's a really specific problem that a generalized model won't work for or another example more recently that I faced was I was trying to add multiple interpretability capabilities so that's like integrated gradients check out pie charts captain if you haven't heard of it but basically it's it makes it difficult to make your models interpretable if you're using someone else's architecture and you can't really add in the extra math of extra explained ability features so actually this is a good time to tell you guys push is actually gonna do with it whole series so this is her first time but she's gonna be two more talks at the next two salons about machine learning explain ability so if you're interested in that you should definitely come to the next one trying to see if there's more questions what not functions are effective for these kinds of networks the loss function also kind of depends on your use case it's a bit tough to say a classic kind of starter one that if you're if you're not familiar like what to use at all I recommend just going with like and I guess it really depends on your use case but maybe like cross centerpiece like a good one for some things it really depends on what you're trying to penalize and how much you're trying to penalize it so that's a really broad question I'm not sure Lavanya do you know any resources off the top of your head for just like comparing and contrasting loss functions yes I will drop it in and have to give to me notes this is really this guy he does the deepest dives on these like basic parts of your networks and most people don't pay attention to but I dropped it I think those are other questions we'll see you again next week I guess I'm super excited yeah thanks for having me here guys

Original Description

Pujaa Rajan's presentation is on sequence models: RNNs, GRUs, LSTMs, BiRNNs, and Deep RNNs. Come to learn about sequence models and stay for the Star Wars gifs. Pujaa is a deep learning engineer at http://Node.io, and the USA Ambassador at WomeinAI.co. She previously worked at BlackRock and graduated from Cornell. 👩🏼‍🚀Weights and Biases: We’re always free for academics and open source projects. Email carey@wandb.com with any questions or feature suggestions. - Blog: https://www.wandb.com/articles - Gallery: See what you can create with W&B - https://wandb.ai/fully-connected - Continue the conversation on our slack community - http://wandb.me/fs
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 50 of 60

1 0. What is machine learning?
0. What is machine learning?
Weights & Biases
2 1. Build Your First Machine Learning Model
1. Build Your First Machine Learning Model
Weights & Biases
3 Intro to ML: Course Overview
Intro to ML: Course Overview
Weights & Biases
4 2. Multi-Layer Perceptrons
2. Multi-Layer Perceptrons
Weights & Biases
5 3. Convolutional Neural Networks
3. Convolutional Neural Networks
Weights & Biases
6 Weights & Biases at OpenAI
Weights & Biases at OpenAI
Weights & Biases
7 Why Experiment Tracking is Crucial to OpenAI
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
8 4. Autoencoders
4. Autoencoders
Weights & Biases
9 5. Sentiment Analysis
5. Sentiment Analysis
Weights & Biases
10 6. Recurrent Neural Networks [RNNs]
6. Recurrent Neural Networks [RNNs]
Weights & Biases
11 7. Text Generation using LSTMs and GRUs
7. Text Generation using LSTMs and GRUs
Weights & Biases
12 8. Text Classification Using Convolutional Neural Networks
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
13 9. Hybrid LSTMs [Long Short-Term Memory]
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
14 Toyota Research Institute on Experiment Tracking with Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
15 Weights and Biases - Developer Tools for Deep Learning
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
16 Introducing Weights & Biases
Introducing Weights & Biases
Weights & Biases
17 10. Seq2Seq Models
10. Seq2Seq Models
Weights & Biases
18 11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
19 12. One-shot learning for teaching neural networks to classify objects never seen before
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
20 13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
21 14. Data Augmentation | Keras
14. Data Augmentation | Keras
Weights & Biases
22 15. Batch Size and Learning Rate in CNNs
15. Batch Size and Learning Rate in CNNs
Weights & Biases
23 Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
24 Grading Rubric for AI Applications with Sergey Karayev  (2019)
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
25 16. Video Frame Prediction using CNNs and LSTMs (2019)
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
26 Image to LaTeX - Applied Deep Learning Fellowship (2019)
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
27 17.  Build and Deploy an Emotion Classifier (2019)
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
28 Applied Deep Learning - Data Management with Josh Tobin (2019)
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
29 Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
30 Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
31 Troubleshooting and Iterating ML Models with Lee Redden (2019)
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
32 Designing a Machine Learning Project with Neal Khosla (2019)
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
33 Lukas Beiwald on ML Tools and Experiment Management (2019)
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
34 Building Machine Learning Teams with Josh Tobin (2019)
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
35 Pieter Abeel on Potential Deep Learning Research Directions  (2019)
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
36 Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
37 Five Lessons for Team-Oriented Research with Peter Welder (2019)
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
38 Applied Deep Learning - Rosanne Liu on AI Research (2019)
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
39 Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
40 Organizing ML projects — W&B walkthrough (2020)
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
41 Brandon Rohrer — Machine Learning in Production for Robots
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
42 Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
43 My experiments with Reinforcement Learning with Jariullah Safi
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
44 Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
45 Testing Machine Learning Models with Eric Schles
Testing Machine Learning Models with Eric Schles
Weights & Biases
46 How Linear Algebra is not like Algebra with Charles Frye
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
47 Predicting Protein Structures using Deep Learning with Jonathan King
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
48 Rachael Tatman — Conversational AI and Linguistics
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
49 Reformer by Han Lee
Reformer by Han Lee
Weights & Biases
Sequence Models with Pujaa Rajan
Sequence Models with Pujaa Rajan
Weights & Biases
51 GitHub Actions & Machine Learning Workflows with Hamel Husain
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
52 Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
53 Jack Clark — Building Trustworthy AI Systems
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
54 Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
55 Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
56 Antipatterns in open source research code with Jariullah Safi
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
57 Attention for time series forecasting & COVID predictions - Isaac Godfried
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
58 Made with ML - Goku Mohandas
Made with ML - Goku Mohandas
Weights & Biases
59 Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
60 Deep Learning Salon by Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases

This video teaches the basics of sequence models, including RNNs, GRUs, and LSTMs, and their applications in NLP tasks. It also covers model interpretability and provides resources for further learning. By watching this video, viewers can gain a solid understanding of sequence models and how to apply them to real-world problems.

Key Takeaways
  1. Pass the first word in the sentence to a hidden layer
  2. Calculate multiple parameters inside the hidden layer
  3. Output a prediction using the activation function and parameters
  4. Repeat the pattern for the number of words in the sentence
  5. Forcefully stop the sequence model by adding a maximum number of time steps
  6. Use GRU and LSTM units to solve the vanishing gradient problem
💡 Sequence models, particularly RNNs and LSTMs, are useful for sequence-based tasks, but they can suffer from vanishing gradients. GRUs and LSTMs can help mitigate this issue.

Related AI Lessons

DevOps Took 10 Years to Mature.
MLOps is distinct from DevOps and solves unique problems, requiring a different approach
Medium · DevOps
Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI
Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance
Medium · DevOps
Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx
Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development
Dev.to · Shannon Dias
MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages
Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation
Dev.to AI
Up next
Pole Pruner How A Rope Lever Shears High Branches
Innoforge Studio
Watch →