Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020
Key Takeaways
The video discusses Universal Adversarial Perturbations (UAPs) and their application to language models, demonstrating how UAPs can be used to create triggers that affect model predictions and exploring the implications for model interpretability and robustness. The presentation covers techniques such as the hot clip attack and the use of word vectors to represent language in a continuous space.
Full Transcript
great thank you hi everyone I'm Pamela welcome to this production of Hamilton on Disney plus today we'll be talking about that is the laugh line no I got we're talking about adversarial attacks on NLP models so a lot of my work in the program has been about critically thinking about how we motivate this work this talk will mostly be a commentary and analysis about the state of the literature on adversarial attacks in NLP so as some background the image space has robust literature I'm using targeted manipulations and model interpret ability to understand and things like adversarial attacks controllable generation and intersectional bias these push there's also push from policy makers to understand how marvelous work and what failure states exist models exist in the wild we're beholdin for their bias and their failure States everyday and so I sort of wanted to strengthen my understanding of how we quantify those failure States so a lot of this work is motivated from a 20-19 paper from Alan AI led by a team with Eric Wallace where they define they call a universal adversarial trigger which is a short phrase that can cause a specific model prediction and when concatenated to any input from a dataset and the kind of productive fresh your slides so that we get the most updated one as it seems to be stuck stop is that there there we go okay well if you try so the paper demonstrates the triggers transfer between models they're both model agnostic and inferred agnostic what does this look like so let's say we have a sentiment classifier can you see the diagram okay um so let's say we have a sentiment classifier um given the input the movie was awful we'd expected to classify that as negative um but a trigger would look like something like and this is a trigger we found when we ran this some of you uh invigorating captivating um so that appended to the movie was awful would flip the resultant classifier from negative to positive it seems like we're still having a little bit of trouble with your slides I'm not showing a blank it takes a little bit to show now you're finally know I'm fine okay can you see them now still okay I might have to do you might have to see my whole browser I guess is the punchline does that is that right yeah slide and if it doesn't refresh really quickly we might move your slides for you okay so so this would flip the output from negative to positive and they sort of clear why this is a failure in the case of the classifier right we know what the output should be this is this is changing the output to the wrong answer in a language model it's a bit less clear so what a language model does is given an input like the movie was we'd expect it to complete that text in a way that makes sense given the contact so the movie was a great film yeah okay response from a language model what a trigger would do here is when we append some tokens wrap you know fun basketball football all sports words to the movie was the language model would instead complete that with text about sports and it's kind of less clear whether we should consider this a failure this trigger in particular prepended or appended to the input may not be content preserving on a simply this many words about sports might change the meaning or intent of the original input that's another example adding a six token string to the end of any Shakespeare play shouldn't result in hate speech even if the play is the Merchant of Venice so we want to sort of understand how stealthy we need to make these triggers to make the the language model qualify as a failure um and that question is clearer in other spaces like audio or vision we can use whether a trigger is perceptible or imperceptible by humans as a guide we don't have a tool to easily assess in perceptibility for language um but one thing we may want to try is making trigger stealthier so both as short as possible and constrain to language that makes semantic or natural sense if I saw this trigger in the wild I'd say something's up you don't just throw a bunch of sports words together um whereas well it's difficult so where it's difficult to say whether the behavior in this slide should be considered a mistake or not something like this we're just dependent cats - the movie was and language models except dogs we could probably say is wrong so how do we find these triggers great questions um so we can't apply the techniques develop in the vision space directly to this problem for one language is discrete whereas images can be continuous as a way of seeing that think about a rainbow goes from red to orange to yellow we touch every color in between we don't have the language to describe all of those colors in between so we approximate this is the hot clip attack and this slide is which you can hopefully I'll see it um it's taken from the original paper so we come up with a neutral trigger in this case the the and we'll append that to a batch of examples we then back prop on the gradient maximized and we'll likelihood at the class we're trying to flip - so here you see a bunch of positive examples of that film and we're trying to flip those to negative and we do this for some number of iterations we go from the two movie Apollo spider to zoning tapping themes in the end so we do it for some of our iterations or until we don't see any changes in the loss and just to note in the language model case our loss might maximize for example the likelihood of the target outputs we're trying to find a trigger for so one condition on it any user input we should reach those target outlets and using this we replicated the results in the original paper across of tasks so sentiment analysis natural language inference squad and GBD two attacks on these tasks we see that accuracy drops close to zero and you can see also that we when we allow the trigger length to increase attack becomes more potent so the less selphie the trigger the better it is a viewing this increasingly the results on the right is with random attacks rather than hop flip and it also works pretty well so it's unclear whether we need to do all of this maximizing the feature space great so why would we want them attack like this that's a good question um I think this is where my work deviates from Wallace is a bit the motives for an adversary to engage in this kind of attack or weak the clearest use of universal trigger is that we may not have access to the target model or the particular input at runtime because Universal attacks do not require a white box access and work on any input they can be easily distributed even without technical knowledge but it's still kind of unclear how you to use them so for example the threat model is posed by a very deliberate and unlikely attack you come up with the universal adversarial trigger you hack into a GPT to server you append the trigger to all inputs come and go for the server and you kind of watch a wreak havoc but if your real goal is to wreak havoc you could also just write some hate speech and post it online which is far easier in a car as far less technical know-how and if we look at how G adversaries use gbg to it in the model in the wild not that many people were really engaging in text like this just as one example if you search for GPT 2 on YouTube some of the first results are people being like how do I use this to boost my channel but with largely neutral comments not with how do I attack someone else's channel with hate comments so as another motivation we can sort of think about what triggers as examples of failure states of our model so we've approached if not reached human level accuracy on a number of tasks returning to the sentiment analysis we were talking about before here you see that we're approaching 100% accuracy honest - um what a few questions remain in this like 3% that we're missing what are we missing how robust are these models and what has the model really learned and how generalize generalizable is it so - dick on on the first question our robust er models in real life language undergoes perturbations all the time you might say your friend wow that movie was so good and they might miss the sarcasm and check someone's game she said the movie was really good and the classifier will then say that that was a positive review these deviations could take multiple forms they can be adversarial triggers they could be random noise a typo they could be structured nuanced tone is an example sarcasm obfuscation hedging a lot of questions along the way or sort of data set bias second question how generalizable our models how prone are they to memorization we know that Elam's learn from any potential data sources the model design also influences how likely they are to generalize from that data or memorize that data so we see in low resource languages on google translate given kind of nonsense triggers the language model will devolve into sort of satanic verses from the Bible and this can be weird when it comes to Bible verses but also creepy when it means to memorize personal data as it came from Berkeley showed so we want to sort of like see triggers as examples or ways to sort of get at answering these questions so I tried to do that first I looked at the stealthiness question of just like how good can we make the trigger given this threat model and how and what happens when we try and do it so we were able to replicate the results of the original paper the decreased accuracy on classification tasks top tasks and random attacks tended to work as well when we tried to force the triggers to be more stealthy by sampling directly from GPT - instead of using hot flip the results were less promising so I don't want to say definitively that there don't exist stealthy triggers that will flip SSD - for example but we weren't able to find them in the few techniques we tried as another one movie sort of looked at these results in the paper that were forcing triggers to get G B D G to devolve into generating hate speech in our experiments coming up with triggers to generate hate speech wound up with triggers that were largely hate speech in and of themselves notably these triggers transferred to GPT three as well you saw that there are twenty percent of cases they also create a cape speech on GP g3 and they also regularly highlight particular people so I'm not going to show examples of hate speech right now but at a lot of the racism triggers we saw the word Coulter presumably my friend and Poulter you saw Hannity presumably referring to Sean Hannity and so while these are public figures and we don't necessarily just be concerned about that being private information about them it is interesting to note there's sort of aspects of those particular people but G PDQ has learned in Sanko sucked we also saw the hate speech triggers target at a particular protected class produced outputs against other classes as well so racism and sexism triggers on GPT three produce ablest at homophobic text so we also were considered how they applied to charge but slightly less polarizing text so we showed that triggers exist for other top fixed looked at vaccines and brexit um though our evaluation suggested that they're slightly less potent so here are two of the triggers we found for vaccinations and and we appended them to we used generate on their own we use this first seven lines of the Merchant of Venice and the first one at the Merchant of Venice prepended and appendage to trigger and here are some of the results so you can see that the first one it has to do with vaccines which is expected but in and of itself is not an SI backs whereas all of the text in the target text was an tee box um the second appended to the merchants in Venice you sort of lose the meaning of the trigger altogether um we see that this is not Shakespearean language but is a key PDQ approximation of it but a penalty just one line we do sort of still see a mixing up the queue so this isn't the best example but we saw a lot of examples of kind of Shakespearean talk about sickness or illness for the things you expect vaccines to be associated with so whereas a racism trigger would always produce racist content an auntie backs trigger won't always please auntie backs on it so in conclusion um targeted perturbations are a rich area in the image space and if you used to create generative models and as a Terp rebill 'ti tools they are to apply this language this method to language we need clear normative goals for LMS and and all these systems we did show that models are still brittle even some of the best trigger is transferred to gbg3 in more study of the triggers we find would be an interesting direction to take this so why do they behave in this way and why do they tell us about how models learn in general in this one way one direction we might take this for gbg3 and few shot learning you're given a task description and a number of examples but the class description is written by humans with an idea of how they would bring that task we can instead consider the task description as a trigger and back up on these examples to find sort of maximize the likelihood that the model will be able to perform this task um so my thinks everyone particularly my mentor Alec who I'm sure is did not want me to thank him it's great and my fellow scholars were great and Christina Mariah and everyone I chatted with on slack throughout the day really fun program so you and I time Thanks it is there a way to represent languages continuous I mean we sort of cast language into a continuous space using word vectors but it's and we can sort of see a continuous like aspects learned of that language so for example we take one of these vectors for king - Queen and we might get man - woman um I was representing the same thing but that's all what approximation so yeah if you think of a language while senator Specter couldn't attacker be motivated by changing the sentiment um I think they could it's just again and like if you were to knock into a system and upend this trigger to any task by the time you're in assist them doing that you could just force the output you want um I don't know why it's lower I have a sense of why it worked which is we were using a very brittle classifier or just looking at simple lsdm um sorry I guess I should read the questions why do you think the rate at which the accuracy drops off in terms of the trigger length is lower for random than non-random fares um I don't have a sense of why it's lower I think it works because we were using a pretty brittle classifier we're tagging sentiment on pretty short sentences anyway so do like the classifier is just getting distracted by the additional words regardless of what those words are seemed to be uncommon or even nonsensical phrases on the inputs every means of trigger or out of distribution detection I think that's definitely a reasonable question I think that's why that's all another reason I think it's that you can to frame this work as these triggers aren't going to be seen in the wild though absolutely if you did encounter one in the wild you automatically be able to talk to them I'm wary of anything that sort of like suggests we just build it attack the detector on top of a bottle because if you look at the image space every defense you come up with for one of these attacks another attack just emerges in its place I think we can instead see the full class of things we consider attacks or defenses and sort of say that actually tells us something interesting about how these models work and we can recache the problem in that way hope there's nothing okay could you experiment with the granularity of the triggers um did a little I also tried to do more or and it didn't work so we did I did for the GPD to trigger is try and sample directly from GPD to unique language that didn't feel jarring to encounter in the wild and they don't work nearly as well and when they do it sort of makes sense that they work um we think it's an example of the language model doing what we'd want them to be wouldn't what we'd want it to do
Original Description
Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#pamela
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from OpenAI · OpenAI · 45 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
▶
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Robots that Learn
OpenAI
Emergence of Grounded Compositional Language in Multi-Agent Populations
OpenAI
OpenAI + Dota 2
OpenAI
Dendi vs. OpenAI at The International 2017
OpenAI
Competitive Self-Play
OpenAI
Learning a Hierarchy
OpenAI
Physical Spam Detection
OpenAI
Ingredients for Robotics Research
OpenAI
OpenAI Five
OpenAI
OpenAI Five: Dota Gameplay
OpenAI
Learning Dexterity
OpenAI
Learning Dexterity: Uncut
OpenAI
OpenAI Five Benchmark: Post-Game Analysis
OpenAI
Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House
OpenAI
Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House
OpenAI
A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House
OpenAI
Learning Dexterity | Alex Ray | 2018 Summer Intern Open House
OpenAI
Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House
OpenAI
Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018
OpenAI
Reinforcement Learning with Prediction-Based Rewards
OpenAI
OpenAI Spinning Up in Deep RL Workshop
OpenAI
Arena Announcement and Closing | OpenAI Five Finals (6/6)
OpenAI
Co-Op Match | OpenAI Five Finals (5/6)
OpenAI
OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)
OpenAI
OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)
OpenAI
Pre-Match Panel Discussion | OpenAI Five Finals (2/6)
OpenAI
Opening Keynote | OpenAI Five Finals (1/6)
OpenAI
OpenAI Robotics Symposium 2019
OpenAI
OpenAI Scholars Demo Day 2019
OpenAI
Multi-Agent Hide and Seek
OpenAI
Solving Rubik’s Cube with a Robot Hand: Uncut
OpenAI
Solving Rubik’s Cube with a Robot Hand: Perturbations
OpenAI
Solving Rubik’s Cube with a Robot Hand
OpenAI
Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018
OpenAI
Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018
OpenAI
Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018
OpenAI
Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018
OpenAI
Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018
OpenAI
Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020
OpenAI
Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020
OpenAI
Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020
OpenAI
Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020
OpenAI
Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020
OpenAI
Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020
OpenAI
Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020
OpenAI
Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020
OpenAI
Introduction by Sam Altman | OpenAI Scholars Demo Day 2021
OpenAI
Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021
OpenAI
Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021
OpenAI
Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021
OpenAI
Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021
OpenAI
Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021
OpenAI
Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021
OpenAI
Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021
OpenAI
Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021
OpenAI
Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021
OpenAI
Creating a Space Game with OpenAI Codex
OpenAI
“Hello World” with OpenAI Codex
OpenAI
Talking to Your Computer with OpenAI Codex
OpenAI
Data Science with OpenAI Codex
OpenAI
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
Medium · Data Science
The End of Knowledge: Why Artificial Intelligence Is Changing Not Only What We Know, but What It…
Medium · AI
Japan Gave the World Robots, Bullet Trains, and PlayStation. So Why Is It Losing the AI Race?
Medium · AI
🎓
Tutor Explanation
DeepCamp AI