Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

OpenAI · Intermediate ·📰 AI News & Updates ·5y ago

Skills: LLM Foundations80%Fine-tuning LLMs70%Multimodal LLMs60%

Key Takeaways

The video discusses scaling laws for language transfer learning, focusing on pre-trained English language models transferred to other languages, with tools such as OpenAI Scholars, GPT2, and Transformers. The speaker explores the effectiveness of pre-training, fine-tuning, and model size on transfer learning across languages.

Full Transcript

hi everyone i'm christina kim and i'm really excited to present my scholars project on the scaling laws for language transfer learning um so throughout the open ai scholars program i was really interested in questions around data what characteristics and attributes are there and how does that impact model performance so for my project i looked at how do the scaling wells look for pre-trained english language models as we transfer to other languages so historically the advancement of deep learning capabilities has been centered around three different levers so that's better algorithms faster and cheaper compute and larger high quality data sets given machine learning's potential significant impact in society deepening our general understanding of machine learning and how certain factors improve models is critical for making better predictions of which capabilities are going to develop next and when further the exploration of scaling laws evidence across these three factors has created a way to measure the impact of these three as they interact and limit each other so my projects framework is inspired by the work on scaling laws which was published by openai in the past year scaling laws predict machine learning performance as i said as a function of model size data set size and the amount of compute used for training so you can think of compute data set size and model size as different limiting factors that you can be changing to get better performance and recently scaling relationships were found for transfer learning from pre-trained english texas models to python so scaling loss for transfer are important because the scaling relationships can help explain how to work in a limited data regime so in an ideal world you're going to have an infinite amount of data for your models to be learning from and by that i mean that you're only limited by the other two factors compute and model size but getting a large quantity of high quality data is a non-trivial task and it's oftentimes near impossible as a result most problems that we want to study are actually in this low data regime before the scholars program i was a machine learning engineer and i saw firsthand how costly it is in both time and money to get good quality data evaluating these trade-offs is a pretty important and practical question that many researchers and practitioners have to handle so building upon the work from scaling laws for transfer my experiments try to answer the question how much does pre-training actually help when we're transferring across different languages um being chinese spanish and german and what does that look like as we vary the data set size and model size so for my experiments i first had to pre-train english language models um and i pre-trained decoder only transformers of size 124 million non-embedding parameters to my smallest model size which was 3.3 million non-embedding parameters i trained this all on open web text too which is an open source version of webtext which was used to train gpt2 i used the same hyper parameters from the original scaling laws for neural languages paper except i used a 500 step warm up but the cosine decay to 10 of the max learning rate here um the text was encoded with the same gpt2 tokenizer um which is a byte level byte pair encoding with a vocab size of 50 000. and all the models were trained to about 26 billion tokens and as you can see here um my models exhibit scaling laws similar to what was found in the scaling laws for neural languages except this line isn't quite the linear here um and that kind of indicates that maybe my largest models are under trained a bit here after getting my pre-trained models um i next set up my fine tuning experiments so for my fine tuning experiments i wanted to focus on changing the number of tokens and data while holding performance which in our case was cross entry loss and model size constant so for these experiments the data set size spanned six orders of magnitude while the model sizes span two orders of magnitude and i trained this on three different languages which were chinese spanish and german so for the chinese data set i use this data site called community qa which is similar to the web text corpus and then for german and spanish i got it from oscar which is a multilingual corpus got by classifying the common crawl corpus so in my experiments the thing that i really wanted to measure was the effective data transfer so what does that look like when we are training from english text to chinese spanish and german text and so the effective data transfer can be measured as this is the amount of fine tuning data needed to get to this loss when we're using a pre-trained model and then this purple dotted line is the amount of additional data that we would need to get to that same loss when we're training from scratch on this data set size so it's important to note here that as you can see the amount of data transferred from pre-training gets smaller as we increase the number of tokens in the data set size that we're looking at and eventually for this model it converges around 10 million tokens for the data set size so i wanted to show you what it looks like when we actually compare these three languages so this is like the exciting bit here and so you can see that for the pre-trained english models they help the most when we're learning german versus spanish and chinese and that kind of makes sense because i think these results reflect a lot about the linguistic similarities between english and these other languages so english and german are both derived from proto-germanic and are linguistically most similar and although spanish shares many of the same symbols as the english alphabet it's actually in a different family of languages and then obviously chinese has a very very different alphabet than the english alphabet um and it's very distinct there another thing i want to highlight here is a bit about the shape of the lines and the distance between them so as you can see the effect of data transfer for spanish and chinese is not too different at this initial point here for a data set size of 8 000 tokens however as we increase the data set size we can see that pre-training continues to help for another order of magnitude compared to chinese here another way to think about the amount of data how much data is actually useful from pre-training is to think about the fraction of effective data of fine-tuning so the smaller this fraction is it means more pre-training means pre-training has helped us more so as you can see in these graphs here as the model size increases um this fraction decreases all languages which means that pre-training has become more effective um but as we increase the data set size this fraction increases across model sizes and that means pre-training has become less effective here a lot of these results here on this graph show the same points that i brought up on the previous slide about how far apart or maybe these distributions are from each other and as you can see that the german graph here has steeper curves compared to the spanish and chinese and i think that indicates that there's more transfer happening for german compared to the other two languages another interesting thing that we found was that pre-training helps most in low data regimes so in a low data regime pre-training is most helpful across the data size across model sizes but especially in the smaller model sizes and you can see here as i increase the model size with the fixed data set size of chinese text to find uh to fine tune on models trained from scratch on chinese did not improve while the models were the models pre-trained on english continue to achieve better performance so you can see here that these flat lines here are where we're data limited um in the setup versus when we start to see an increase in the slope uh we're now parameter limited another important thing to note is that pre-training and using pre-trained models is way more compute efficient than uh using uh training from scratch and you can see this here um for this one model size for this one data set size i want to talk about some limitations that some of my experiments had and so the first one is i use the same tokenizer for all languages so this is an issue because as i mentioned before the tokenizer had a 50k vocab size and chinese um has over 50 000 characters in its uh language so that means a lot of the tokenization is probably quite inefficient and so this could impact model performance quite a bit so i think for future work uh you'd want to train your own tokenizers um and then transfer to learn from there another point is that it looks like from my original uh plots for the pre-training that maybe i could have been pre-training for longer um then i think i could have done a more linear line for some of the scaling laws that i saw for the open web text models another thing um that i would want to do is do a more thorough hyper parameter sweep and learning rate sweep um as i believe that both of these uh limitations uh would cause very very different results um and i believe the numbers that i've gotten in the previous slides would be very different had i found the ideal optimum learning rates for the different data set sizes and model sizes one other note is that my data the languages that i got are from different sources and so i think this experiment could be more thorough if i had to use the same data set source for all three of the languages i want to talk about some future work that i'm really excited about after this project so i think one thing that could be really interesting is to compare the effective data transfer as we use pre-trained models of a different language back to english um then you can maybe create some kind of mapping of how far apart are distributions from each other is there some kind of symmetry in the data transfer there and what does that actually look like another obvious next steps would be to actually use the setup to do work in low resource languages or other tasks and distributions that are quite different from english another thing that would be very cool to do based on this work would be to predict the ideal ratio for pre-trained versus fine-tune for any given problem for some compute for some budget that you would have um another thing that i think would be interesting in the same format of experimentation will be studying the forgetting problem in transfer learning and see what that effective data transfer looks like as we um are approaching this problem before i answer questions i want to give some thanks to folks i want to thank jt for sharing his wisdom with me throughout the program and keeping our project on track um and for staying up late now from poland to hear this uh my fellow scholars especially danielle and cujo for sharing compute with me and everyone that gave me feedback throughout the process and program um especially danny a shout out to openai for making all this possible great so now i'll answer some questions that i have here i have a question that says which model architecture was used to transfer learning across models um and also which one was trained from scratch so the model architecture that i used is the same like gpt gbt transformer so which is a decoder only uh transformer [Music] um um i have a question that says how would you extrapolate what kinds of gains from pre-training you'd get from models smaller or larger than you've been training or from smaller and larger data sets um so i think you would just be able to see the similar trends that we saw in my previous slides for the different data set sizes and i think as you as i saw the main takeaway is that if you have a large uh pre-training data a fine-tuning data set um it may uh you're not going to get as many gains as you would get from a much smaller fine-tuning data set [Music] another question is how is my setup related to the scaling laws for transfer paper by danny hernandez from earlier this year um so a lot of my work is super inspired by danny's experiments there so i did the same type of experimentation where i was changing the data set size as i was varying the model sizes and as i was comparing for the loss between those i had a question this last question says did you consider transfer between other types of languages say programming languages and i would actually say that you should check out the scaling laws for transfer paper because that actually does look into how does english transfer to python [Music] um so i got another question that says did you get a chance to study performance on metrics other than loss and i didn't but i'd be kind of curious to see how you could characterize this on uh downstream tasks and i think that's like a pretty big thing to look at for transfer learning in particular uh there's a question that says would you like to use a different tokenizer in the future and uh yeah definitely i think being using the train tokenizer on the specific languages would get you much better results and therefore probably much cleaner graphs um and then a question that says was there any reason you decided not to train models smaller than two million parameters um not particularly i just thought much much smaller models than that would result in uh losses that weren't maybe that interesting to look at since it would be parameter limited um very quickly awesome so i think that's all my time so i'm going to pass it off to danielle who's going to be presenting her project

Original Description

Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#christina

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from OpenAI · OpenAI · 52 of 60

← Previous Next →

Robots that Learn

Robots that Learn

Emergence of Grounded Compositional Language in Multi-Agent Populations

Emergence of Grounded Compositional Language in Multi-Agent Populations

OpenAI + Dota 2

OpenAI + Dota 2

Dendi vs. OpenAI at The International 2017

Dendi vs. OpenAI at The International 2017

Competitive Self-Play

Competitive Self-Play

Learning a Hierarchy

Learning a Hierarchy

Physical Spam Detection

Physical Spam Detection

Ingredients for Robotics Research

Ingredients for Robotics Research

OpenAI Five: Dota Gameplay

OpenAI Five: Dota Gameplay

Learning Dexterity

Learning Dexterity

Learning Dexterity: Uncut

Learning Dexterity: Uncut

OpenAI Five Benchmark: Post-Game Analysis

OpenAI Five Benchmark: Post-Game Analysis

Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House

Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House

Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House

A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House

Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House

Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018

Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018

Reinforcement Learning with Prediction-Based Rewards

Reinforcement Learning with Prediction-Based Rewards

OpenAI Spinning Up in Deep RL Workshop

OpenAI Spinning Up in Deep RL Workshop

Arena Announcement and Closing | OpenAI Five Finals (6/6)

Arena Announcement and Closing | OpenAI Five Finals (6/6)

Co-Op Match | OpenAI Five Finals (5/6)

Co-Op Match | OpenAI Five Finals (5/6)

OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)

OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)

OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)

OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)

Pre-Match Panel Discussion | OpenAI Five Finals (2/6)

Pre-Match Panel Discussion | OpenAI Five Finals (2/6)

Opening Keynote | OpenAI Five Finals (1/6)

Opening Keynote | OpenAI Five Finals (1/6)

OpenAI Robotics Symposium 2019

OpenAI Robotics Symposium 2019

OpenAI Scholars Demo Day 2019

OpenAI Scholars Demo Day 2019

Multi-Agent Hide and Seek

Multi-Agent Hide and Seek

Solving Rubik’s Cube with a Robot Hand: Uncut

Solving Rubik’s Cube with a Robot Hand: Uncut

Solving Rubik’s Cube with a Robot Hand: Perturbations

Solving Rubik’s Cube with a Robot Hand: Perturbations

Solving Rubik’s Cube with a Robot Hand

Solving Rubik’s Cube with a Robot Hand

Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018

Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018

Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018

Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018

Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018

Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018

Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018

Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018

Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018

Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018

Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020

Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020

Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020

Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020

Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020

Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020

Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020

Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020

Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020

Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020

Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020

Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020

Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020

Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020

Introduction by Sam Altman | OpenAI Scholars Demo Day 2021

Introduction by Sam Altman | OpenAI Scholars Demo Day 2021

Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021

Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021

Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021

Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021

Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021

Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021

Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021

Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021

Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021

Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021

Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021

Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021

Creating a Space Game with OpenAI Codex

Creating a Space Game with OpenAI Codex

“Hello World” with OpenAI Codex

“Hello World” with OpenAI Codex

Talking to Your Computer with OpenAI Codex

Talking to Your Computer with OpenAI Codex

Data Science with OpenAI Codex

Data Science with OpenAI Codex

This video teaches the importance of scaling laws in language transfer learning, highlighting the effectiveness of pre-training and fine-tuning for low data regimes. The speaker explores the relationships between model size, data set size, and compute used for training, providing insights into transfer learning across languages.

Key Takeaways

Pre-train English language models using OpenAI Scholars and GPT2
Fine-tune pre-trained models on target languages
Measure effective data transfer from pre-training to fine-tuning
Explore scaling laws for transfer learning
Investigate performance on metrics other than loss

💡 Pre-training helps most in low data regimes and with smaller model sizes, but its effectiveness decreases as data set size increases.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

The AI Hype Cycle: Calm Before the Next Breakthrough?

Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions

Medium · Programming

AI won’t replace scientists. It will make the current model of science obsolete

AI is not replacing scientists, but rather making the current model of science obsolete, enabling new forms of discovery and collaboration

Medium · Data Science

The End of Knowledge: Why Artificial Intelligence Is Changing Not Only What We Know, but What It…

AI is changing the concept of knowledge and how we acquire it, making us question what it means to know something

Japan Gave the World Robots, Bullet Trains, and PlayStation. So Why Is It Losing the AI Race?

Japan, a pioneer in technology, is struggling to keep up in the AI race, and understanding the reasons behind this can provide valuable insights for other countries and businesses

Motorist saved by human chain | 9 News Australia

9 News Australia