Stochastic RNNs without Teacher-Forcing

Yannic Kilcher · Advanced ·📄 Research Papers Explained ·7y ago

Skills: LLM Foundations70%

Key Takeaways

Presents a stochastic non-autoregressive RNN without teacher-forcing for training

Full Transcript

hi everybody my name is Florian and Yannick was nice enough to host him here as a guest to talk about stochastic earnings without teacher forcing this is based on recent work deep state-space models for unconditional word generation which we presented at this year's new ribs and if you like any more details please check out the paper we focus on a de facto standard training hack for any ardennes that generate text it's called teacher forcing and it's used in any model whether unconditional or conditioners such as in the sentence auto encoder or in a translation model to understand where teacher frozen comes from we first need to understand where text generation comes from for the good or the bad and here we will focus on the bad the text generation has its roots in language modeling so language modeling is the problem of predicting the next word given all the previous words people used to use angra models for this but to tape people use recurrent neural networks to do that such recurrent neural networks or RNAs factorize the joint observation probability of a sequence that I hear depicted W into independent softmax distributions over individual tokens so for every time step there's a software's function and the softmax is conditioned on a hidden state and all the magic of the RNN goes into the function that gives you the new state given the old hidden state usually this is called a transition function f and as an input it gets the last state and the last word so f could be a GU function on Alice TM function just like any other language small you can turn this into a generative model of text let's look at the dependencies that you would have a test time there's an initial hidden state h1 we sample a new word we use our transition function f and gives us the new state h 2 then we can sample a new word w2 feed it back at a new state simply a new word feed it back it's important to note that all the stochasticity in the output is solely due to the stochasticity in the sampling process because the transition function is deterministic so far there's nothing to complain about but so far I've only talked about test time a training time there is a catch this is where teacher forsen kicks in it turns out that you can't learn this model by basing the evolution of the hidden states on your own predictions you have to use teacher forcing and that means you substitute your own predictions by the ground choose so a training time there's no sampling loop you just take the ground truth token and feed it into your state transition function so that feels unintuitive because at test time we do something else then we do a training time it's also known in the literature for a few years to cause biases so why is that problematic remember we come from language modeling announced modeling we could argue that if our only goal is to predict one word given the previous words then of course we can use the ground truth context to ground truth previous words but if we're interested in generating like longer sequences then we need to learn what to memorize and in particular we need to become robust against our own predictions because we might make mistakes at test time and there's no ground truth at test time just to get this confirmed by somebody who is worked in the field for years at the new rips for presentation learning workshop Alex Grave mentioned teacher forcing and as one of the big three problems for auto regressive models and in his own words TV or forcing might lead to predict one step ahead not many and potentially brittle generation in myopic representations half people address teacher forcing so far their approaches to try to mitigate a problem for example by blending together these two views training time and test time so that sometimes you use your own prediction you're in training but sometimes you use to grant truce we believe for a rigorous model of text generation we need a rigorous model of uncertainty this should be an integral part of any generative model and therefore it should be the same model both a training time and test time without any hacks we propose a fundamentally different approach by proposing a new transition function the new transition function is non auto regressive that means it depends on the last stage HT minus one but it doesn't depend on the last word that means teacher forcing is not an option anymore but it also means teacher forcing is not a problem anymore instead to transition function accepts a white noise vector as the second input now you might wonder why do we need noise at all as an input to the transition function well for given prefix there might be different continuations so we need some source of entropy to model the entropy in different continuations the rest of the paper pretty much focuses on the following two questions a which function f is powerful enough to turn the most simple noise source just a standard Gaussian vector into something that's powerful enough to replace the autoregressive feedback mechanism of a standard Arnon the second question is of course how do we train this what framework to be trained this in and it will turn out that variation of flows are suitable functions F and variational inference is the right wing framework to train them so here's the road map to complete the model first we need to cast the genitive model as a probabilistic method because so far I've only sketched a procedure that involves sampling some noise and then applying some function and then predicting observations then we need to propose a variation of inference models so that we can do maximum likelihood training we will derive an elbow which is our objective then in the paper we also describe how the tightness of the elbow can be improved and here I will finish by talking a bit about evaluation and what we do to inspect the model since this work is based a lot on variational flows let me give you a quick summary or variation of flows a rational flow is deform or fizzing F which maps from what I would call a simple Norton space X I to a complex noise space H and here I'm already using the notation for a sequence model simply by the change of variable formula we know that the probability of an event H in the complex space is simply the probability of the event in the simplest basic sigh as given by the inverse of F times AJ in terms respect to F evaluated at sign how can we use this in our sequential setting first let me fix some notation because sequential models are pretty prone to overloaded notation all right time as T around running from 1 to capital T and whenever I talk about a sequence of variables like W I don't index them I just write W without an index and only when I need a specific element I'll write it as W T let's formalize the generative model we start out with a probability of observing a sequence W and since we use the latent variable model we marginalize out the latent variables H and then we will assume that the overall dependencies between hidden States H and observations W follow like an hmm type of dependency that means the new state only depends on the last state and the current observation only depends on the current state and now the question is how do we model these transitions I've so far pitched the ideas of sampling noise and then using some transition function f and we've seen flows already now we are ready to combined it to we proposed a transition function f G which has the signature as I mentioned before it gets a hidden state and a noise phase vector as an input and it gives you a new state as an output this can be seen as a conditional flow because any h t minus 1 any last state inserted as the first argument into F G and uses a flow which maps from the simple noise distribution to the space of new hidden States and as I've said before for the prior distribution in the simple noise space we simply assume it's a standard Gaussian let's look at this graphically because in the end this is a graphical model I copied over the formulas from the last slide and at the bottom you see the graphical model first we have a sequence of stochastic variables X I those deterministically induce via the transition function f fired a flow a sequence of hidden states and those independently predict the observations all the magic is in a transition so let me sketch this process here in the big circle how do we get from the last state h2 to the new state h3 let's say h2 encodes the prefix and there are two possible combinations they're equally likely in the corpus so there are two potential new states the blue state h3 and the yellow stage h3 I've sketched the standard Gaussian noise distribution at the top there yellow samples and their blue samples the flow realizes a mapping that takes any yellow sample and maps it to the yellow hidden state and its maps any blue sample to the blue hidden state so with probability 1/2 in this situation we either get a blue or a yellow sample from the simple noise distribution and it will G induce new States blue h3 or the yellow h3 so far we have proposed the generative model now the question is how do we train it if you don't know the hidden States the answer is rational inference and in particular amortised variational inference the key idea of variation inference is to introduce a parametrized approximate inference model how do we propose such a model well a good recipe is to first look at a true posterior the probability of a state sequence given an observation sequence the true posterior turns out to factorize into individual components which give us the probability of a state given the last state and the future observations it turns out that we can formulate this inference model using two ingredients that should be familiar first we use a transition function FQ which induces a flow it has the same signature as FG for the generative model and we use a noise source Q but now the noise source isn't uninformative anymore in variation inference the inference network is a form Tabata data so there's a base distribution Q of excit which is allowed to look at the data WT now compared is to teach you forcing and teacher forcing we substitute our own predictions by inserting ground truth information into the generative model and very it's very clear how to use the data the data enters through the inference model it enters in the form of future observation because the past observation we want to store in the hidden state it remains to derive an elbow which is the usual evidence law about objective used for racial inference any elbow whether it's in a sequential setting or not factorizes into two parts a reconstruction loss and a model mismatch term here reconstruction loss means probability of an observation giveness state and modern mismatch is between the genitive model P and the inference model Q this is what is usually written as a KL divergence to derive our elbow we follow the literature on flaws in a first step we introduced a flaw on the influence model FQ we turn the expectation with respect to the complex state space age into an expectation with respect to the simple noise distribution and then of course at the same time the flow appears inside the expectation and we cut the lock determinant in terms that I've mentioned before in a second step we introduce the generative flow FG using the same change of variable technique it's possible to write out the elbow in a way so that has only one Jacobian term for both flows and so that the genitive model always appears as the inverse concatenated with the influence flow in a second I'll show you what the imputation of that is let's quickly recap what we've seen so far there's a generative model it consists of a genitive flow F G and an uninformed noise source there's an inference model which contains at inference flow F Q and a simple based distribution across the noise variables Q of X I in the elbow the two flows appear concatenated and we can interpret this in the following way the infants model Q proposes a noise vector excit that is informed about the future the infants flow maps this two hidden state at the hidden state the reconstruction loss lives this is where we pay a price for making a bad prediction hi the infants model cannot encode all the possible information about the future into the hidden state HT because the mapping continues to the simple noise base of the genitive model and the inference model must make sure that the proposal also covers significant probability mass under the uninformed prior this trade-off between reconstruction and model mismatch is common to all elbows but here we highlight a special situation where we have two flows one for the infant's model in one fine general model in our paper we also show how we can use the recently proposed important bladed autoencoder to improve the tightness of our bond but I'll skip those steps here instead let's quickly talk about evaluation we apply our model to unconditional generation so why in hell would somebody look into an unconditional generation well actually it turns out it's harder than conditional generation if you know what the French sentence looks like it's much easier to continue a partial English translation but it's not only harder it's also more interesting to inspect which information doesn't sequence model need to store and which information cannot forget we use two metrics to evaluate our model first we look at sequence cross-entropy so we compare the model's sequence distribution to the data sequence distribution usually estimating the data distribution is impossible you don't want to say that the probability of a sentence is how many times the sentence has appeared in the training data however for words we can use unigram frequencies of words in a corpus as a pretty reliable estimate also we can get an estimate of our models probability assigned to a sequence by using MC sampling we take the marginal likelihood sample k trajectories and assess the probability that the trajectories assigned to the given sequence since our model is not autoregressive the sequence isn't tied to an observation so we can actually use the same sequences of hidden states to evaluate probabilities for all the words in the vocabulary since we've pitched our noise model as the kitra contribution to our generative model we want to empirically verify that the model is being used working with a clean probabilistic model allows us to use tools from ability Seri to assess that we use the mutual information between a noise vector at time T and the observation of time T so this measures how much information in the output is actually due to the noise model before showing you the numbers let's quickly go across the province digitization of our model for the flows we look at shift scaling transformations and if the scaling GE is lower triangular we can compute efficiently the Jacobian determinant we also look at real MVP and we compose flows by concatenation the based distribution of our infants model depends on the future observations which we summarized using a GI u RN n the based distribution itself is a diagonal Gaussian we use a state size of 8 and also run some experiments for 16 and 32 all the numbers are in the paper so here are just two take-home messages we on par or better than atomistic are an N for teacher forcing train at the same state size also we observed it a powerful generative flow is essential to achieve good performance furthermore we can confirm that important point is elbow improve the results this the first model employing genitive flows to sequence modelling so naturally we are interested in comparing the expressiveness of F G and F Q our paper has a table that compares four choices for post flows our findings are that the genitive flow should be powerful and the infants flow should be slightly less powerful to understand our noise model we look at the mutual information at every time step and show a boxplot for all of them initially the mutual information is highest which means the initial character is most important to remember the noise model is never being ignored and we see increased variance in the remaining time steps because we are averaging here across different sequences the non autoregressive model needs to have lower entropy in the observation model because any under entropy under the observation model is being forgotten because there's no feedback the Purple Line shows you the observation model entropy during training the dashed red line shows you the entropy on the observation model of a baseline so indeed we have lower entropy in observation model and at the same time in green is the Demeter information increasing let's summarize our findings using rational flows none other aggressive modeling of sequences is possible and teacher forcing is not necessary at the same time we get a noise model that is a driving factor of the sequence model and is easy to interpret for any details please check out a paper and find any questions shoot me an email

Original Description

We present a stochastic non-autoregressive RNN that does not require teacher-forcing for training. The content is based on our 2018 NeurIPS paper: Deep State Space Models for Unconditional Word Generation https://arxiv.org/abs/1806.04550

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 9 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom