Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021
Key Takeaways
The video discusses scaling laws for language transfer learning, focusing on pre-trained English language models transferred to other languages, with tools such as OpenAI Scholars, GPT2, and Transformers. The speaker explores the effectiveness of pre-training, fine-tuning, and model size on transfer learning across languages.
Full Transcript
hi everyone i'm christina kim and i'm really excited to present my scholars project on the scaling laws for language transfer learning um so throughout the open ai scholars program i was really interested in questions around data what characteristics and attributes are there and how does that impact model performance so for my project i looked at how do the scaling wells look for pre-trained english language models as we transfer to other languages so historically the advancement of deep learning capabilities has been centered around three different levers so that's better algorithms faster and cheaper compute and larger high quality data sets given machine learning's potential significant impact in society deepening our general understanding of machine learning and how certain factors improve models is critical for making better predictions of which capabilities are going to develop next and when further the exploration of scaling laws evidence across these three factors has created a way to measure the impact of these three as they interact and limit each other so my projects framework is inspired by the work on scaling laws which was published by openai in the past year scaling laws predict machine learning performance as i said as a function of model size data set size and the amount of compute used for training so you can think of compute data set size and model size as different limiting factors that you can be changing to get better performance and recently scaling relationships were found for transfer learning from pre-trained english texas models to python so scaling loss for transfer are important because the scaling relationships can help explain how to work in a limited data regime so in an ideal world you're going to have an infinite amount of data for your models to be learning from and by that i mean that you're only limited by the other two factors compute and model size but getting a large quantity of high quality data is a non-trivial task and it's oftentimes near impossible as a result most problems that we want to study are actually in this low data regime before the scholars program i was a machine learning engineer and i saw firsthand how costly it is in both time and money to get good quality data evaluating these trade-offs is a pretty important and practical question that many researchers and practitioners have to handle so building upon the work from scaling laws for transfer my experiments try to answer the question how much does pre-training actually help when we're transferring across different languages um being chinese spanish and german and what does that look like as we vary the data set size and model size so for my experiments i first had to pre-train english language models um and i pre-trained decoder only transformers of size 124 million non-embedding parameters to my smallest model size which was 3.3 million non-embedding parameters i trained this all on open web text too which is an open source version of webtext which was used to train gpt2 i used the same hyper parameters from the original scaling laws for neural languages paper except i used a 500 step warm up but the cosine decay to 10 of the max learning rate here um the text was encoded with the same gpt2 tokenizer um which is a byte level byte pair encoding with a vocab size of 50 000. and all the models were trained to about 26 billion tokens and as you can see here um my models exhibit scaling laws similar to what was found in the scaling laws for neural languages except this line isn't quite the linear here um and that kind of indicates that maybe my largest models are under trained a bit here after getting my pre-trained models um i next set up my fine tuning experiments so for my fine tuning experiments i wanted to focus on changing the number of tokens and data while holding performance which in our case was cross entry loss and model size constant so for these experiments the data set size spanned six orders of magnitude while the model sizes span two orders of magnitude and i trained this on three different languages which were chinese spanish and german so for the chinese data set i use this data site called community qa which is similar to the web text corpus and then for german and spanish i got it from oscar which is a multilingual corpus got by classifying the common crawl corpus so in my experiments the thing that i really wanted to measure was the effective data transfer so what does that look like when we are training from english text to chinese spanish and german text and so the effective data transfer can be measured as this is the amount of fine tuning data needed to get to this loss when we're using a pre-trained model and then this purple dotted line is the amount of additional data that we would need to get to that same loss when we're training from scratch on this data set size so it's important to note here that as you can see the amount of data transferred from pre-training gets smaller as we increase the number of tokens in the data set size that we're looking at and eventually for this model it converges around 10 million tokens for the data set size so i wanted to show you what it looks like when we actually compare these three languages so this is like the exciting bit here and so you can see that for the pre-trained english models they help the most when we're learning german versus spanish and chinese and that kind of makes sense because i think these results reflect a lot about the linguistic similarities between english and these other languages so english and german are both derived from proto-germanic and are linguistically most similar and although spanish shares many of the same symbols as the english alphabet it's actually in a different family of languages and then obviously chinese has a very very different alphabet than the english alphabet um and it's very distinct there another thing i want to highlight here is a bit about the shape of the lines and the distance between them so as you can see the effect of data transfer for spanish and chinese is not too different at this initial point here for a data set size of 8 000 tokens however as we increase the data set size we can see that pre-training continues to help for another order of magnitude compared to chinese here another way to think about the amount of data how much data is actually useful from pre-training is to think about the fraction of effective data of fine-tuning so the smaller this fraction is it means more pre-training means pre-training has helped us more so as you can see in these graphs here as the model size increases um this fraction decreases all languages which means that pre-training has become more effective um but as we increase the data set size this fraction increases across model sizes and that means pre-training has become less effective here a lot of these results here on this graph show the same points that i brought up on the previous slide about how far apart or maybe these distributions are from each other and as you can see that the german graph here has steeper curves compared to the spanish and chinese and i think that indicates that there's more transfer happening for german compared to the other two languages another interesting thing that we found was that pre-training helps most in low data regimes so in a low data regime pre-training is most helpful across the data size across model sizes but especially in the smaller model sizes and you can see here as i increase the model size with the fixed data set size of chinese text to find uh to fine tune on models trained from scratch on chinese did not improve while the models were the models pre-trained on english continue to achieve better performance so you can see here that these flat lines here are where we're data limited um in the setup versus when we start to see an increase in the slope uh we're now parameter limited another important thing to note is that pre-training and using pre-trained models is way more compute efficient than uh using uh training from scratch and you can see this here um for this one model size for this one data set size i want to talk about some limitations that some of my experiments had and so the first one is i use the same tokenizer for all languages so this is an issue because as i mentioned before the tokenizer had a 50k vocab size and chinese um has over 50 000 characters in its uh language so that means a lot of the tokenization is probably quite inefficient and so this could impact model performance quite a bit so i think for future work uh you'd want to train your own tokenizers um and then transfer to learn from there another point is that it looks like from my original uh plots for the pre-training that maybe i could have been pre-training for longer um then i think i could have done a more linear line for some of the scaling laws that i saw for the open web text models another thing um that i would want to do is do a more thorough hyper parameter sweep and learning rate sweep um as i believe that both of these uh limitations uh would cause very very different results um and i believe the numbers that i've gotten in the previous slides would be very different had i found the ideal optimum learning rates for the different data set sizes and model sizes one other note is that my data the languages that i got are from different sources and so i think this experiment could be more thorough if i had to use the same data set source for all three of the languages i want to talk about some future work that i'm really excited about after this project so i think one thing that could be really interesting is to compare the effective data transfer as we use pre-trained models of a different language back to english um then you can maybe create some kind of mapping of how far apart are distributions from each other is there some kind of symmetry in the data transfer there and what does that actually look like another obvious next steps would be to actually use the setup to do work in low resource languages or other tasks and distributions that are quite different from english another thing that would be very cool to do based on this work would be to predict the ideal ratio for pre-trained versus fine-tune for any given problem for some compute for some budget that you would have um another thing that i think would be interesting in the same format of experimentation will be studying the forgetting problem in transfer learning and see what that effective data transfer looks like as we um are approaching this problem before i answer questions i want to give some thanks to folks i want to thank jt for sharing his wisdom with me throughout the program and keeping our project on track um and for staying up late now from poland to hear this uh my fellow scholars especially danielle and cujo for sharing compute with me and everyone that gave me feedback throughout the process and program um especially danny a shout out to openai for making all this possible great so now i'll answer some questions that i have here i have a question that says which model architecture was used to transfer learning across models um and also which one was trained from scratch so the model architecture that i used is the same like gpt gbt transformer so which is a decoder only uh transformer [Music] um um i have a question that says how would you extrapolate what kinds of gains from pre-training you'd get from models smaller or larger than you've been training or from smaller and larger data sets um so i think you would just be able to see the similar trends that we saw in my previous slides for the different data set sizes and i think as you as i saw the main takeaway is that if you have a large uh pre-training data a fine-tuning data set um it may uh you're not going to get as many gains as you would get from a much smaller fine-tuning data set [Music] another question is how is my setup related to the scaling laws for transfer paper by danny hernandez from earlier this year um so a lot of my work is super inspired by danny's experiments there so i did the same type of experimentation where i was changing the data set size as i was varying the model sizes and as i was comparing for the loss between those i had a question this last question says did you consider transfer between other types of languages say programming languages and i would actually say that you should check out the scaling laws for transfer paper because that actually does look into how does english transfer to python [Music] um so i got another question that says did you get a chance to study performance on metrics other than loss and i didn't but i'd be kind of curious to see how you could characterize this on uh downstream tasks and i think that's like a pretty big thing to look at for transfer learning in particular uh there's a question that says would you like to use a different tokenizer in the future and uh yeah definitely i think being using the train tokenizer on the specific languages would get you much better results and therefore probably much cleaner graphs um and then a question that says was there any reason you decided not to train models smaller than two million parameters um not particularly i just thought much much smaller models than that would result in uh losses that weren't maybe that interesting to look at since it would be parameter limited um very quickly awesome so i think that's all my time so i'm going to pass it off to danielle who's going to be presenting her project
Original Description
Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#christina
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from OpenAI · OpenAI · 52 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
▶
53
54
55
56
57
58
59
60
Robots that Learn
OpenAI
Emergence of Grounded Compositional Language in Multi-Agent Populations
OpenAI
OpenAI + Dota 2
OpenAI
Dendi vs. OpenAI at The International 2017
OpenAI
Competitive Self-Play
OpenAI
Learning a Hierarchy
OpenAI
Physical Spam Detection
OpenAI
Ingredients for Robotics Research
OpenAI
OpenAI Five
OpenAI
OpenAI Five: Dota Gameplay
OpenAI
Learning Dexterity
OpenAI
Learning Dexterity: Uncut
OpenAI
OpenAI Five Benchmark: Post-Game Analysis
OpenAI
Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House
OpenAI
Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House
OpenAI
A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House
OpenAI
Learning Dexterity | Alex Ray | 2018 Summer Intern Open House
OpenAI
Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House
OpenAI
Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018
OpenAI
Reinforcement Learning with Prediction-Based Rewards
OpenAI
OpenAI Spinning Up in Deep RL Workshop
OpenAI
Arena Announcement and Closing | OpenAI Five Finals (6/6)
OpenAI
Co-Op Match | OpenAI Five Finals (5/6)
OpenAI
OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)
OpenAI
OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)
OpenAI
Pre-Match Panel Discussion | OpenAI Five Finals (2/6)
OpenAI
Opening Keynote | OpenAI Five Finals (1/6)
OpenAI
OpenAI Robotics Symposium 2019
OpenAI
OpenAI Scholars Demo Day 2019
OpenAI
Multi-Agent Hide and Seek
OpenAI
Solving Rubik’s Cube with a Robot Hand: Uncut
OpenAI
Solving Rubik’s Cube with a Robot Hand: Perturbations
OpenAI
Solving Rubik’s Cube with a Robot Hand
OpenAI
Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018
OpenAI
Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018
OpenAI
Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018
OpenAI
Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018
OpenAI
Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018
OpenAI
Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020
OpenAI
Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020
OpenAI
Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020
OpenAI
Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020
OpenAI
Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020
OpenAI
Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020
OpenAI
Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020
OpenAI
Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020
OpenAI
Introduction by Sam Altman | OpenAI Scholars Demo Day 2021
OpenAI
Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021
OpenAI
Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021
OpenAI
Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021
OpenAI
Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021
OpenAI
Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021
OpenAI
Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021
OpenAI
Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021
OpenAI
Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021
OpenAI
Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021
OpenAI
Creating a Space Game with OpenAI Codex
OpenAI
“Hello World” with OpenAI Codex
OpenAI
Talking to Your Computer with OpenAI Codex
OpenAI
Data Science with OpenAI Codex
OpenAI
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
Medium · Data Science
The End of Knowledge: Why Artificial Intelligence Is Changing Not Only What We Know, but What It…
Medium · AI
Japan Gave the World Robots, Bullet Trains, and PlayStation. So Why Is It Losing the AI Race?
Medium · AI
🎓
Tutor Explanation
DeepCamp AI