RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yannic Kilcher · Advanced ·📄 Research Papers Explained ·6y ago

Skills: Reading ML Papers90%Research Methods80%ML Maths Basics70%

Key Takeaways

The video discusses RoBERTa, a robustly optimized BERT pretraining approach, and its performance compared to other methods on different datasets, including the original BERT data and the CC news dataset, using tools like BERT, RoBERTa, Adam, and Common Crawl.

Full Transcript

hello everyone today we're looking at Roberta a robustly optimized bird pre-training approached by eun-hyung Leo and I love mainly on Facebook research so these this paper is a pretty short pretty simple paper and the main premise is we've seen a number of improvements over the initial bird paper where different different pre-training of of the transformer architecture or extension extensions of the architecture have been shown to have better performance than the original bird model and this paper basically says if you get the design choices right then Burt is able to basically be on par or exceed all of these other methods so far so they're they're basically exploring design choices in the pre training and training of of Burt alright so if you don't know what Burt is by the way I have made a video about Burt I've also made a video about transformers a in very quick terms Burt is a language neural network architecture that takes as input text such as this kind of thing you see here texts such as that and it will kind of encode it out and it can do various things for example classify it into certain categories or kind of segmented extract answers from questions and so on that the whole thing is is pre trained with what's called a mask language model objective which were you don't need labels to train it so in a masked language model objective you basically mask out certain words during training and then you ask Burt to reconstruct these words from the surrounding information and that kind of has given some improvements in the original bird paper but subsequent papers have claimed that you can improve even more by using different retraining objectives and so on such as Excel net but here these researchers basically explore different things so they use a regular bird architecture that's what they describe here so they use both the bird base at 12 layer as well as the 24 layer bird that has originally originally been described they use masked language modeling as a pre training objective and they explore the necessity of this next sentence prediction loss that has been part of bird so along with the mask sentence modeling bird has also had an objective where if you input a piece of actually you input two pieces of text two sentences such as this these are two sentences and bird has to decide if the second sentence follows the first sentence in the corpus or in 50% of the cases the second sentence is sampled from a different document this kind of HA is so the original paper argued this is necessary to incorporate long distance relationships between text yeah here the the NSP objective was designed to improve performance on downstream tasks such as natural language inference and this paper kind of explores the necessity of that loss in terms of optimization there is of course kind of a pre-training scheme and then a training scheme using Adam here with certain parameters and also this paper explores the use of this of these parameters lastly you have data and of course these these models sometimes they're trained on different data and that's why they're comparing them it makes it a bit harder to compare them because the pre training is done on differently sized and different races differently structured data this this paper also trust in the state the influence of the training date and especially what happens if we keep the training data constant so all right so they implement birthday reemployment Burt and then they fix some hyper parameters while they tune others and first of all the data set so they use different data sets the original Burt has been trained on this book corpus and Wikipedia English Wikipedia data set which is 16 gigabytes of large now this paper here collects a what's this CC news data set which is a subset of the common crawl news data set which is all in so the subsidies that is the English portion and that's 76 gigabytes which is on par with for example what GPT 2 used I believe so this is a very large training set and kind of comparing this original data to the large corpus kind of what influence that is should make very clear what the influence of more training of more pre training data is now they also have a number of other corpora open web text as well as here I believe there's one more stories yes so these are also are pretty sizable but these are like yeah these are like halfs very specific schemas to them then the evaluation here happens on several different kind of downstream tasks so the idea is you first you pre train this Bert model on with the masked language modeling and so on and then you have this glue task which is actually a collection of nine tasks and you have some some other tasks such as squad which is a question answering task and here race I don't even I don't know what that is in particular but just suffice to say these are kind of downstream NLP tasks the paper isn't about these downstream tasks but that is just a way to measure how well your pre-training worked if then you can fine-tune on such a task and you get a good performance but what the tasks are in particular isn't too important all right so here we get into the meat of the paper first they they decide on what they call static versus dynamic masking so in the original bird paper whenever they do masked language modeling they take a piece of text and they basically replicate it a bunch of times because they want to iterate through training data a bunch of times and then they in each iteration they mask out different different tokens and they compare the they compare this to what's called dynamic masking so this is static masking dynamic masking sorry dynamic masking would be where you in each basically on the fly generate your mask you don't pre compute it and save it you on-the-fly generate it this allows you to go through kind of more or less of the data as you want and when you encounter the same sample twice even though you replicate it in the original bird model you could still encounter it twice if you trained for longer than the number of replications then you basically see the exact same mask again and the the dynamic masking actually much more useful it's much more ad hoc each time you see a sample you generate the mask on the fly so they compare this here and they see that there is a marginal improvement that you're higher as better marginal improvement in two tasks and a less marginal decrease in performance in one tasks so they decide that this dynamic masking is of use second thing they investigate is the kind of input format and this next sentence prediction so as as I already said the original bird training objective always gets two sentences next to each other and has to decide if the second one follows from the first one actually it doesn't it observes two concatenated document segments which are either sampled contiguously from the same document or from distinct documents and this is half-and-half so in addition to the mask language modeling the model is trained to predict whether the observed documents segments come from the same or distinct document via an auxiliary next sentence prediction loss and they investigate different ways of including or excluding this loss so first is what they they define if here if it's plus and SP that means that this particular thing includes the next sentence or next segment prediction loss so they have segments pair plus n SP which means that each input has a pair of segments and these segments now the difference the distinction between a segment and a sentence is important where at where the sentence is really a natural sentence a segment can actually be multiple natural sentences which is what the original bird does so as long as the combined length is less than 512 tokens there can also be multiple sentences but there's clearly two segments and you have to decide if they follow after each other or not the second thing they try is the same thing so the next segment prediction but now it's just two sentences it's just natural sentences so it must be one sentence a a a a call a period sorry and and then the next sentence a period and you have to distinguish these two if they follow or not then they investigate full sentences which is they leave away this next segment prediction loss and they simply fill up the 512 tokens with text from the corpus so each input is packed with full sentences sampled contiguously from one or more documents and the one or more document means if you so if you sample text right to your sample here text you put all of this in the thing and you are at the end of a document you simply continue with the next one and go on until you have to be 512 tokens so you basically fill fill fill until you have 512 tokens and that's that's this this variant here and then in the last variant you do the same thing that's called dock sentences but you basically you stop at the end so even so you put all of this in your state and even if you here you stop and then you have to you know be content by simply padding the rest of the 512 tokens or something like this so you don't have as much data but the all the text that you have in one sample is actually contiguous text from the same document so they hit these four things against each other this is this table here and as you can see here the best thing is this doc sentences thing so on these thing followed by the full sentences and coding right so there's some some ambiguities here but in general you can kind of rank them as best second best and then here third best and four best and they conclude that this next segment or next sentence prediction loss here is more hurtful than helpful in the ways we see here and they say even though this is most most effective they in their case they'd rather go with this one because it's well I guess easier to implement you get more data through the model in the same time and the performance decrease isn't that much so but it's pretty interesting to see that this next next segment next sense prediction isn't super super helpful in in actuality here so removing the NSP loss matches or slightly improves the downstream task performance this is yeah in contrast to what the original bird authors found but you have to keep in mind this is also non hasn't a bunch of other changes in then next thing they investigate batch size so batch size sorry batch size pretty seems to be pretty interesting for these large models in that they love large batch sizes and they actually explore batch sizes 512 here as a smallest one and they go up to 8,000 so this they do this actually in a data parallel way where there are many many machines with many GPUs and they paralyze the data and then they accumulate the gradient of all of these different samples and so they can go up to a batch size about 8k and they find generally that the 2,000 batch size here as you can see helps to improve the so perplexity lower is better and the other numbers higher is better helps to to improve the performances if you control the control for data set size so the number of times you go through the data set is the same but if you go it with a larger batch size that seems to help up to a point here that mm seems to be the best they found so again a marginal improvement you can make by training with larger batch sizes and then this the last thing they've looked at is actually is text encoding so how do you encode text and the the pit here is basically between bite pair encoding or word piece encoding to that to decide how large your vocabulary is basically and as I understand it they didn't find a much of a difference between the different implementations of the text encoding so they decide they go with they decide to go with one I don't even remember one which one I think they go decided to go with by pair encoding instead of word pieces all right so they combine all of this into Roberta which is the robust the optimized Burt approach and they say Roberta is trained with dynamic masking so what they showed first full sentence without the next segment prediction loss large mini-batches a larger byte level byte parent coding as well as of course their collection of training data and then here they also investigate how long to pre train so if you look at the original burt models or the excel net models and then compare it to Roberta's Roberta this is the original data and they already beat Burt yet they do not they do not yet beat Excel net with that so if they add data they get even better actually on par mostly with the with Excel that if they pre train longer they get even better and if they were to say pre-trained even longer right so that here's the the number of steps if you're a number of steps then match the number of steps that the Excel net does with the same additional data then or with their additional data then you outperform Excel net as well so this this kind of just an an overview of this and they evaluate on other downstream tasks and they basically show that in most of them they can reach state-of-the-art performance or exceed it with their approach and in conclusion they basically say well this only shows that kind of the the gains that these other models make and the reasons why they make gains may be questionable if you simply pre-trained burped in a better way you can reach the same performances so I think the end is not reached yet most of all they publish their code their data I believe I have not looked into this but definitely check out their repository where this is implemented it seems pretty easy seems pretty straightforward and that was it for me bye-bye

Original Description

This paper shows that the original BERT model, if trained correctly, can outperform all of the improvements that have been proposed lately, raising questions about the necessity and reasoning behind these. Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code. Authors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov https://arxiv.org/abs/1907.11692 YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Minds: https://www.minds.com/ykilcher BitChute: https://www.bitchute.com/channel/10a5ui845DOJ/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 31 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

The video discusses RoBERTa, a robustly optimized BERT pretraining approach, and its performance compared to other methods on different datasets. The paper explores the influence of training data and hyperparameters on the performance of RoBERTa and finds that the next segment prediction loss is more hurtful than helpful. The optimal batch size for training RoBERTa is around 2,000, and larger batch sizes can help up to a point, but may not be the best approach.

Key Takeaways

Train a language model using masked language modeling and next sentence prediction
Explore the influence of training data and hyperparameters on language model performance
Evaluate the performance of language models on different datasets
Use dynamic masking and byte pair encoding to improve language model performance
Experiment with different batch sizes to find the optimal value

💡 The next segment prediction loss is more hurtful than helpful, and the optimal batch size for training RoBERTa is around 2,000.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom