Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research · Advanced ·📐 ML Fundamentals ·5y ago

Key Takeaways

The video discusses challenges and opportunities in learning from limited labeled data for NLP, covering topics such as weak supervision, semi-supervised learning, cross-lingual transfer learning, and domain adaptation, with a focus on retrieval augmented generation and fine-tuning techniques using tools like Pert Turing, GPT, and BERT

Full Transcript

[Music] hi everyone and welcome to the learning from limited label data session in the frontiers of machine learning event my name is ahmad awadallah i'm from microsoft research ai and today i will go i'm going to be talking to you about how do we try to bring ai experiences to everyone overcoming the challenges with limited label data as an information worker you have to deal with a lot of sources of information in order for you to be productive from your documents and your slides to your email and calendar if you are a developer you also have your code and your pull requests and bugs if you are a salesperson you also have your customer data and leads and so on so forth in microsoft we have been working very hard on trying to leverage ai in order to help everyone be more productive whether that be by understanding the content of your email messages to better prioritize them or recommend the actions for you to take on them or by better recommendations that tries to predict what are you trying to accomplish right now and sharing and making available the right information for you at the right time and we strive to build intelligent experiences quickly and efficiently that allow us to reach more markets languages domains and tasks however recent machine learning techniques specifically deep learning require large amounts of training data and if you look at the figure i'm seeing i'm showing on the screen you would see the performance of one of the glue benchmark data sets with a birth based model as we have more and more training data available to us and we have seen that curve over and over again where even with large-scale pre-trained language models we still need a lot of training data for the task at hand in order to achieve the desired performance so if we were if we want to reach languages and markets and domains what would be the best way around that can we just annotate all the data that we need if we are only interested in 100 tasks and each task require a moderate amount of training data like in the order of tens of thousands but we also want to support 100 languages for 100 different organizations the numbers add up pretty quickly and we find ourselves faced with the need of collecting hundreds of billions of annotated data this is not only hard and time consuming and expensive but it's also in many cases not feasible because of the private nature of the data that will not allow us off creating manual annotations for it so let's think about the different phases we go through as we are developing an ai based experience at the beginning we typically have limited amounts of annotated data for the task we are interested in but we might have large amounts of eyes of data and associated metadata size of data is data that's accessible to us but we cannot see or annotate so they are always unlabeled and the associated metadata could be metadata related to the tasks such as user behavior signals and so on in that phase techniques like weak supervision where we can leverage low quality labels based on user behavior or other metadata as sources of supervision for machine learning can be very helpful additionally techniques like semi supervised learning that can leverage unlabeled data by using structured assumptions about the data itself could also be of significant help once we have a version of our experience working in a limited set of languages or domains we can start thinking about expansion to new low resource languages or domains and in that phase cross-lingual transfer learning techniques that try to leverage knowledge available in rich resource languages to help low resource ones are very useful but also domain adaptation techniques that try to do the same but for new domains when our experiences are deployed and and users are interacting and using them that provides us with a very valuable source of knowledge which is user interaction data that comes in the form of implicit and explicit feedback about how humans experience our systems this data could be leveraged in order to learn from implicit and feedback implicit and explicit feedback directly to improve the system and also to build learning to correct models that are able to recover from mistakes that our models are doing finally all of these techniques are built on top and leveraging recent advances in pre-trained language models such as pert turing and gpt so now let's double click on one of them and talk a little bit more about weak supervision i'm specifically interested in the case where we can leverage user behavior data that might be readily available to us as a source of weak supervision this real world data beyond content could still be very helpful for us as we think about building natural language processing models many such applications have a lot of information like user actions current context and so on so forth if we take office 365 as an example we have hundreds of millions of active users creating and interacting with trillions of entities like emails and events just like modern web search engines have been very successful at efficiently leveraging user behavior data such as queries clicks sessions and so on so forth we think we can leverage user behavior session in order to build better machine learning model operating on user content so going back to the assumptions i was describing earlier we assume that we have a limited amount of annotated clean data but we can use user behavior and interactions to generate a much larger amount of weekly annotated data and our objective is to leverage both sources in order to train task specific models specifically we have been finding a lot of success in using a meta learning framework in order to concurrently leveraging the noisy data and the clean data to improve our models more specifically you can you can imagine a setup where we are co-training two models at the same time our main model that's trying to solve the task at hand and another model a meta model that's trying to take in the noisy data and correct or reweight it in a way that makes it more useful for the main model when we co-train these two models concurrently we are able to significantly improve the way we leverage the noisy data in order to improve our systems we have applied these techniques to multiple applications for example when applied to the task of email intent identification we can collect a small amount of annotated data in the orders of hundreds or small amounts or thousands but we can collect a much larger amount of weak label data derived from behavior signals such as reading an email or flagging an email creating a calendar item attaching a document and so on so forth when we concurrently leverage the clean and the weak data together you can see that we can achieve much better performance than using any of them in isolation the weak data tends to be noisy and deep learning methods tend to be very good at modeling the data including the noise so co-training the models on both clean and weak data at the same time allows us to extract the signal from the noisy data in a way that improves our own overall performance without being significantly affected by the noise and you can see that we can achieve pretty good gains in different scenarios even when we have clean data as small as one percent of the total data or larger amounts of training of clean data as well there are some scenarios where you don't have a single source of weak supervision but you can have multiple sources providing multiple labels for the same instance we applied that to this data set of fake news detection where weak labels are derived from behavior signals such as sharing commenting sentiment and so on so forth where we have multiple weak supervision signal for the same aniston and you can see that we can also effectively leverage the multiple sources of supervision and that having multiple sources of supervision can further help now going back to our picture i also wanted to spend some time talking about another scenario learning to correct machine learning models will always make mistakes and they will never be perfect if we are able to build collaborative ai systems that enable fast and progressive interaction between the human and the model allowing them to collaborate in order to improve and solve more complex tasks then we can work around these limitations one way to do that would be to allow the human to interact with the model to identify and correct mistakes we studied this problem in the context of semantic parsing where we have a scenario where you are trying to ask questions against the structured data such as data in a database the semantic parser's job is to take in a natural language query and generate a parse in sql that it can execute against the database what we see here is a case where the user asks the question and the system came back with an interpretation of the question but the system made a mistake in its interpretation if the user is able to provide open form natural language feedback describing what the mistake is and how it can be fixed it can significantly help the model narrow down the scope of where it's looking in order to fix the mistake and we have found out that in this application actually many of the mistakes can be addressed and corrected by seeking feedback from the human many of the mistakes were conditions that were dropped or added by mistakes or some information that should have been included and has not been included or mistakes about deciding how to order the results so how to aggregate over and with and in order to do that we had to construct a data set that allows us to go beyond the parsing problem where we take in a question and generate a parse but think about the correction problem where you take a question specifying an intent from the user an incorrect interpretation generated by a system an actual language feedback provided by the human trying to correct the mistake and finally the correct parts that we would like to get at and we can see that we even with simple methods we can correct more than 25 percent of the mistakes using just one round of the feedback our model here leverages existing models in semantic parsing where we basically try to take the semantic parts that has been generated and try to edit it in order to correct it leveraging the feedback that we have collected from the human 25 percent of mistakes is very valuable given it's only one round of feedback but also we still have a long way to go because as estimated by a human performance we can see that we can correct up to 80 percent of the mistakes leveraging the open forum feedback going back to this picture you see that there are different ways that could allow us to alleviate the challenge of label data scarcity and we just briefly talked about two of them and how we are applying them to some of the problems we are interested in at microsoft such as productivity applications that help you with test completion or question answering and semantic parsing systems that can help you quickly interact with data finding the right information but there are so many other interesting directions that we can pursue and i'm very excited about the three invited talks that we will have in the session where we will hear from experts in the field about different directions related to the theme of our session dr marty hurst from uc berkeley will start by describing some of her lab's work on building text summarization system without text summaries leveraging techniques like reinforcement learning that can learn directly from external reward signals dr graham nubik from carnegie mellon will then talk about how we can expand our models to the long tail of the next 1000 languages finally dr alex ratner from the university of washington will describe some of the work that he has been doing over the years for building machine learning models with weak supervision and applying it to a diverse set of domains such as medical applications or knowledge-based construction we're very excited about this session and we hope that you will enjoy the three invited talks and you will participate in the discussion thank you so much well thank you ahmed for that uh wonderful talk and thank you everyone for joining the session um as a reminder if you aren't paying attention to the chat window uh we will actually be available in the chat window to answer questions um i'm paul bennett a researcher at microsoft research and co-hosting the session with achmann and next we're going to hear from marty hurst marty is a professor in the school of information and the eecs department at berkeley her primary research interests are in search engines information visualization natural language processing and moocs she's been very active in writing on such uh topics as search user interfaces as well as a fellow of the acm and has many rewards throughout her career today she's going to be talking on this theme of limited data about self-supervision vision and summarization is also going to provide some context about summarization challenges in general so with that uh let's go to marty's talk i'm marty hurst and i'm very happy to talk about some research we've recently completed that we're calling summarization without summaries this work was done primarily by phd student philippe lavonne from uc berkeley and with two other collaborators uh andre psy from bloomberg and john kenny from uc berkeley so here's a brief outline of what i'm going to talk about first the umbrella project is the newslends project for which the summarization was one piece then we'll talk about automated summarization in general and then we'll talk about our particular approach to abstractive unsupervised summarization so the overarching project is called newslends and the goal is a better news reader it's also a platform or jumping off place for natural language processing research including summarization like i'll talk about today news lens is available online for you to try and currently consists of a chat bot in a mobile app as well as a web-based interface that incorporates events event-driven stories over time and this incorporates the automated summaries that i'm going to talk about today the newslends data set is really quite large it's been gathered over more than 10 years it includes more than 40 sources from around the world and over 7 million articles and we use this data set for part of the research i'll talk about today so next i'll talk about automated summarization in particular extractive versus abstractive summarization and then prior approaches to abstractive summarization and what our particular definition of a good summary is so if you're doing summarization typically your goal is to take a long document and shorten it in some way but still retain important information from that document so consider this following news article that i'm going to read through a bit since this is our running example and say we want to reduce it to 20 to 25 words of length so the article is about chilean president sebastian panera who announced on wednesday that his country which has been paralyzed by protests over the last two weeks will no longer host two major international summits clashes at demonstrations in the capital of santiago have left at least 20 people dead and led to the resignation of eight key ministers from panera's cabinet the president has now canceled the hosting of the economic apoc forum and cop 25 environmental summit which are both due to take place later this year this was in 2019 so if you're going to make a 20 to 25 word summary what would you want to include in this if you're doing extractive summarization you have to pull out text verbatim so you pull out one or two sentences that's pretty much all you can do so you have to pick between these three sentences among these three sentences and decide which one would make the best summary so it's extractive summarization you pull out a sentence and that's your summary the president has now canceled the hosting of the economic apec forum maybe not the best summary with abstract summarization by contrast you identify key concepts keywords that you want to include in the summary and then you create a brand new summary by taking those terms and interweaving them with other glue words that make for a fluid flowing comprehensible summary chilean president announced his country will not host the apec forum and the cop 25 anymore due to protests in santiago this was actually generated by our system not perfect but i think you might argue but it gives more information than the extractive summarization summary and in a smaller instill in ace brief form abstracted summarization is quite appealing you can tailor the length and keep and often the length of abstractive summaries are shorter than extractive ones because you don't have to pull out existing sentences verbatim you can pack in more key content into a short space and also it can count as derived work which is helpful for intellectual property issues but the challenges are it's much harder to automate abstractive summarization and only recently has there been progress in this area and furthermore it's quite subject to error and especially summarizing news we don't want to make false statements about what happened in the news so what are current approaches to abstractive summarization well the standard approach is to use a seek to seek model where you encode the document and you decode out a summary and what people do for use abstractive summarization is they use an existing very large data set of abstracts and documents summaries and documents and they train using a seek to seek model with teacher forcing this reference see it all use a pointer generator network to do this so the benefits are that the model learns what's in the data and so actually they don't have to focus on summarization as a task it's just a standard kind of approach but the limitations is that the results tend to be more extractive when you actually see the results rather than abstractive and you can't actually control for the length you can't say you want a 25 word summary so much it's really based the output you get is based on the input that you trained on and you need very large collections in order to make this work in the training as they are supervised methods another approach recently came out well in 2017 by paula said all which is to use the rouge metric which is a standard evaluation metric uh for summarization and actually optimize on that evaluation metric so it's rouge is easy to compute because it's just n-gram overlap between the summary and the reference document are between also between reference summaries so how long does your summary overlap with reference summaries so the idea was what happens if we directly optimize our summarizer with the reach score so what happened was it got very high root score so it was successful on that metric but unfortunately since rouge is only an approximation to what a good summary is the summaries were poorly rated by people so here's an example output from that work where the first sentences read kind of well but they're extractive and the sentence or the text outlined in or highlighted in red is really not fluid and doesn't really make a lot of sense so these got poor ratings from people so we propose an alternative we extend police at all's work by building a better evaluation metric and optimizing for that instead of reach and so our approach is we define what we think a good summary is and then we train a reinforcement learning algorithm to optimize on those metrics so what are our metrics what is our definition of a good summary we define a good summary as a brief fluent text that covers the main points of the original document and we'll be emphasizing these three points in the remainder of this talk so now i'll talk about how to summarize without summaries talking about how we get coverage via masking how we retain fluency all within a reinforcement learning loop and we'll present some results so step one is we need to compute what we call coverage so summaries must contain keywords in order to be a good summary of a news article so what we do is identify keywords from the document so here i've highlighted some important keywords that we want to have appear in a summary so we use a tfidf type measure to select terms and notice that all forms of the same word are identified so if host occurs in the original document we also want to identify hosting and all occurrences of that term and we also want to identify entity names and we we basically identify about 30 percent of the documents terms although this is a trained uh hyper parameter and then what we do is we mask out those selected keywords we create a version of a document with those terms masked out then the algorithm must figure out which keywords were in the original document but it has to extract them from the generated summary and this is different than how masking is typically used so we generate a summary and then the algorithm has to figure out what the blanks are in the original document from the summary so this incentivizes the algorithm to put keywords into the summary so if we have this generated summary this fills in 10 of the 15 slots that we blanked out highlighted in green there's more than 10 in there's not 10 words highlighted in green in the summary but that's because host covers host and hosting in the masked out keywords so i'm going to start building up a diagram of the algorithm overall the architecture of the algorithm so first of all our goal is to have a brief fluent text that covers the main points of the original document so we give as input to the summarizer a length a target length which helps us enforce brevity as well as the original document and the summarizer just generates a summary then we mask the document in the manner i just described and we feed the mass document into this coverage model into a coverage model then the coverage model generates a document that's been filled with its best guesses as to what the blanks should be filled with and those blanks are assigned a coverage score so a little more detail about that so those of you who know about bert and its mass language model might see some similarities between that and the coverage model that i'm talking about right now so in the bird mass language model it blanks out a random percentage of tokens of usually 15 and it fills in the blanks using the rest of the unchanged unchanged tokens from the same document whereas what we do is blank out all occurrences of a set of key tokens it's not random it's motivated and it's every occurrence so with bert you might blank out one occurrence of paris and not another occurrence of paris but we blank out all occurrences of paris and then we fill in the blanks using both the unchanged tokens from that document and the unmasked summary and our algorithm does use a burp model and we fine-tune it on our coverage scores so here's that in a bit more detail where the input is a summary followed by a separator followed by the masked document and the output is the fed into a fine-tuned vert whose goal is to identify the fill-ins for the blanks so you can see here chile was the wrong guess for the first mask president was the right guest for the second mask and so on in this case say the algorithm gets 33 of these right and it gets a covered score of 0.33 all right so that's the what we've got there this then the next step is to retain fluency so as we've talked about coverage and what does that do it incentivizes finding keywords or content words but this can lead to generating just a list of keywords which isn't very appropriate for reading so our goal is to balance content and fluency and our approach is to optimize for both simultaneously so we add to this model to this architecture a fluency model which generates a fluency score and finally this is all incorporated into a training loop uh in particular a reinforcement learning training loop and we use the sc st optimization procedure the self-critical sequence training which originally was applied to image captioning and we directly optimized the summer summary score which is a combination of the fluency score plus coverage with two parameters that are learned the fluency model only sees the summary and we use a language model that's fine-tuned on news on the large news collection from news lens actually to obtain a score and we you see here we do some normalization to put the fluency model within a certain range and so the final summary score is a weighted sum of coverage and fluency and in more detail how the training works is we actually generate two candidate summaries s1 and s2 these are generated with two different sampling methods and the details are in the paper we compute a summary score for each of these and then the gradients for update are based on the reinforced algorithm based on the difference between these two scores r1 and r2 and there's uh some details about how if one is lower than it should have been by by default then that causes a change in the expectations so we're in essentially essentially increasing the model likelihood of the summary with the higher reward which increases the expected reward here are a few example training runs or one screenshot of example training runs and what we're seeing here is a trade-off between the fluency score and the coverage and then the summary score altogether uh trained over several days and what you see is at the very beginning we can get pretty high fluency using a language model and very low coverage there's a big spike in fluency which then rapidly drops off as the coverage increases so there's you very much see a trade-off between the two and the summary score shows the two being balanced against each other and then very very slowly increasing over time now i want to show the effect of varying the target length what kind of summaries you get depending on the length of summary you get you are outputting so these are summaries generated by the model so if the target length is 10 we get panera canceled the apex summit at santiago if we make the length 24 we get panera said chileans have been canceling the hosting of the apex summit which was scheduled to take place in november we give it more space 45 words we get sebastian panera announced wednesday that his country will not hold the apex summit which was scheduled to take place in santiago panera said that chileans have been paralyzed by protests over the last two weeks so much better summary it has more space and you see the coverage score increases as the length gets longer which makes sense there's more room for content words while retaining fluency and you can see the dynamic nature the algorithm is able to generate different qualitatively different summaries depending on the length so let's do some results and compare this algorithm to others using the standard measure of rouge we first show supervised methods the top ones pointer generator pointer generator plus coverage and the bottom up algorithm get rouge one and regel scores as shown here of the unsupervised methods of which ours is one text rank which is extractive gp2 zero shot and summary loop um 45 length 45 summaries we're doing better than the unsupervised methods and actually better than pointer generator as well and this is with no training data also we can combine our algorithm with supervised data and we get even better results so if you initialize a supervised algorithm with the summary loop model and then train on only 10 percent of the data we do as well as gpt2 on 100 of the data for the cnn daily news data set uh if you give some really 100 of the data we actually do better than all of the other approaches now if we want to see how abstractive are the summaries generated uh some of these tools some of these algorithms generate rather extractive summaries so let's compare them so we look at the measures of how many errors are made and what sort are they inaccurate these are manually assessed are they inaccurate or are they ungrammatical and then how many abstraction techniques are used so these include compression sentences merging sentences novel sentences and entity manipulation so bottom up has some more errors but it also uses more techniques more abstraction techniques and the summary loop has some a few more errors it's a trade-off there but far more abstraction techniques than the other two with a 57 technique application success success rate so much higher than the other two and with far more abstraction here's an example of a note structure summary generated where the red shows sentence merging and the blue shows sentence simplification it's a sentence merging is taking the words from the left in red and making a new sentence and the blue is a shortened sentence you can see it's doing a lot of abstraction to make a shorter document so to summarize we have some next steps we're actually extending and adapting the approach to other text generation tasks including text simplification and summary style adaptation and this is actually in collaboration with microsoft research and actually philippe labon is doing an internship this summer at msr we also have the chat bot that i mentioned this is also a paper in the acl 2020 demo track and i encourage you to check out the youtube video it uses question answering and generated uh abstracts isn't it so in summary our contributions are summaries that do not require training examples are highly abstractive especially compared to state-of-the-art have configurable length and incorporate key content from the articles and a new approach to reinforcement learning using fill in the blank with motivated choices for terms that balances coverage and fluency that makes use of special techniques to fortify against degenerative cases that i did not talk about here better in the paper and there's code available on github if you'd like to try it out so thank you for your attention i want to we want to thank our sponsors of bloomberg amazon nvidia and now microsoft research for our work going forward and i hope you check out the research in more detail thank you so much for the very interesting talk marty a reminder to everyone that marty is available online as well as all other speakers so please feel free to submit questions in the live chat our next speaker is graeme newbig graham is an associate professor at the language technologist institute in carnegie mellon university his work focuses on natural language processing specifically multilingual models and models that allow us to best natural language interfaces for humans to communicate with computers in their own language he publishes regularly in top venues in natural language processing and machine learning and his work has won several awards including at eminent esel neckell and others most nlp work focuses on few resource-rich languages such as english and french in his talk today graham will talk to us about how can we expand that to the long tail of the next 1000 languages hello my name is graham newbig from carnegie mellon university and i'm very happy to present today about lessons from the long tail methods for nlp in the next 1000 languages so as we certainly know natural language processing techniques have made great progress in the past several years especially for languages like english or other high resource languages like chinese french etc however there's over 6 000 languages in the world and for the great majority of these languages we have very little to none of this great language technology that exists for other languages so why is this the reason for this is because the machine learning techniques that have led to great strides on english or chinese or these other high resource languages rely on large amounts of data for training and if we look at the amount of data available for most of the languages in the world we don't have anywhere near the amount that we would need so this is a graph of all the articles in wikipedia and we can see that very quickly the number of articles drops off with the top 30 languages having many more articles than the remaining ones and once we get down to 300 languages we see that all the languages after that have no articles at all so this is a dire situation and it's even more dire if we look all over the internet where more than half of the articles are in english so why should we worry about these long tail of languages that don't have very much data so one very important reason is that language is an inexorable part of our culture and preserving these languages is very important to preserving and legitimizing the culture and language technology can be a tool for this and a single signal of importance of the culture in addition for humanitarian aid even rudimentary natural language processing can help us understand things in crisis situations so for example in the current coronavirus crisis we are working on creating translation systems that would allow people to understand related information in the languages that they speak finally i think it's just the right thing to do people prefer to interact in their own languages so we should let them so one very strong tool in our toolbox to help scale to these languages is multilingual training and what we do here is we basically take many different languages and we feed them into a single natural language processing model and i'm going to talk about three case studies of work that we've been doing in this specifically tailored towards the languages on the very low end of the resource spectrum specifically universal phone recognition linguistically motivated models for cross-lingual sharing and balance training for multilingual models in all of the sections there are a few takeaways so all of them are based on multilingual training methods in addition to scale down to these languages with very few resources we need to use intuitions from linguistics or advanced machine learning techniques and i'll outline these for each of the sections so first universal phone recognition so one thing to know about very low resource languages is that speech is paramount and the reason for this is that most languages in the world are purely spoken and thus the technology we use will need to go through speech on the other hand even for the least resource languages we are often able to obtain speech so this is an example of a speech collection effort by stephen bird where he has created a simple app that allows you to go to speakers of languages one example of this is augustine a speaker of tembe a language in brazil and have them speak stories or other content in their languages into the app and then you can take this and have another speaker for example emilio who's a portuguese speaker go and translate that into another language such as portuguese however while taking this data or collecting this data is possible this can also result in data graveyards and these data graveyards are basically speech data that's locked up in speech and it's never transcribed so linguists spend a lot of time transcribing this data but unfortunately human effort for doing this takes a lot of time and this wastes precious time leads to poor relations with speakers when a linguist goes in and then can't provide anything to them for a long time and speed is of the essence in these situations so phonetic transcription is the first step in building resources for a language and this is basically where we take in a speech waveform and write down the sounds of the speech this is an example from english last time i used the steel button these sounds can be expressed in a number of ways one way being phonemes which are language dependent units of sounds denoted by slashes but at the same time we can also write down phones which are language independent sounds denoted by square brackets and to give an example of this one phoneme might correspond to multiple phones so time steel and button in north american english all are written with ts for parts of their phonemes but actually these are different sounds and you can try this yourself to see how they differ so transcription is usually done in phonemes because this is easier to think about for linguists but this can also be a problem for cross-lingual transfer is phonemes are language dependent so to give an example of a universal phone recognition model we use some techniques to handle the fact that phonemes differ across languages in multilingual asr training so one way that we could do this and it has been used in previous work is a private phoneme model where basically we predict the phonemes for each language separately but the problem with this is that there's too little sharing between the languages so each language is trained basically independently another thing you could do is you could have a shared phony model where basically you predict all of the phonemes at the same time but unfortunately this gives has too much sharing because the shared phonemes actually are language dependent and differ from language language so what we propose instead is to use a little bit of our linguistic knowledge and say okay first we would like to recognize universal phones and then we have a simple transformation to convert these into language-specific phonemes so we tested this model on 11 high resource languages and then we also took the trained model and applied it to two new languages that had never been seen in the training data and tucson which are actually very low resource endangered languages we evaluate the model on phone error rate so the lower is better and what we were able to find is on the 11 languages we were able to do just about as well as any other model but on the very very low resource languages our method in yellow was able to do much better than all of the others and in addition if you have some idea about what phones tend to appear in the language you can further improve these results so another issue in multilingual models is lexical sharing and what i mean by this is even in very similar languages there might be small spelling variations this is an example of belarusian in russian where each of the words are very similar in their spelling with only a few small variants unfortunately for computers even these small variations can cause them to think they're completely different words in addition there are script differences so for example turkish and wigar are languages from the same language family but turkish is written in latin script and wigar is written in arabic script and also there's morphology or conjugation differences where different languages use different suffixes to indicate grammatical features of the word so we've come up with a few methods to resolve these issues one is a method that's kind of a general purpose method for lexical sharing between languages and it's particularly suited for cross-lingual transfers so the way it works is we take words we decompose them into their character engrams and what this allows us to do is this allows us to kind of find words that have similar spellings and ensure they have similar embeddings in embedding space and this allows us to handle spelling similarity we then have a language specific transformation and this allows us to handle consistent variations between languages so this transform is different across languages and finally we have a semantic embedding uh library where we predict a particular embedding for each word and this allows us to capture latent concepts and we do this essentially to model when uh ling words have very different spellings but correspond to the same concept so in machine translation for low resource languages we found that this was significantly better than other options such as using character engrams only and also perhaps more importantly it significantly improves over subword-based encoding methods such as those used in multilingual vert widely used model here to take this a step further in how we can incorporate linguistic information into our models we note that a skilled linguist for example david is a linguist at cmu can create a reasonable morphological analyzer and transliterator for a new language in new in short order so basically what this can do is this can take a script that's not written in the phonemes like i talked about before and convert it into its pronunciation and assign its morphological tags so we then take these linguistic analyses and represent our word with phoneme engrams uh the lemma of the word in its morphological tags and we were able to find uh that this gave good results on name density recognition and machine translation over languages that were in different scripts and with different morphological features finally i'd like to talk about balancing training for multilingual models so as i mentioned before we have very large problems of data imbalance when we're training multilingual models one solution to this data imbalance that has been used before is temperature sampling and basically what this does is this down samples the most frequent languages and upsamples the least frequent languages when training our models in terms of data size and this is also a method that's used very widely in multilingual training such as multilingual burke or multilingual neural machine translation models what we ask in this work is instead can we learn the data sampling strategy directly in order to maximize our accuracy to do so we turn to a method that we are going to be presenting at icml very soon which is a differentiable data selection and this is a meta learning method that allows us to learn a weighting of training data to optimize a held out development loss and what we mean by this is we essentially have a data score that tries to predict how frequently we should be sampling data and this data score is learned to minimize the development loss on the data set that we care about so the main idea is that the score should upload data that has a similar gradient to the development data so we calculate a reward for this data score based on the cosine similarity between the sample data and the development set that we would like to optimize accuracy on so how can we apply this to multilingual to learning multilingual data usage so the existing approach as i mentioned before is temperature-based uh heuristic sampling and the way this works is basically we take the size of the training data for each language we exponentiate it by a temperature value and use this as our value for the sampling data from each particular language and how we use differentiable data selection to do this instead is we directly parameterize the data score over the standard data set sampling distribution so basically instead of sampling by the data size and heuristic temperature we directly learn the sampling probability itself we then optimize the model over a multilingual development set to make sure that the model learns to be good at processing all of the languages in the development set so we performed experiments on multilingual neural machine translation and we display here gains over a single language baseline where we have temperature sampling the kind of state of the art method here we also have proportional sampling where we sample each uh language according to its overall frequency in the data and the bars here are basically two different data sets and many to one and one to many translations so what we can see is in many to one translation where the target is always english proportional sampling works better and in one-to-many translation proportional sampling does not work well and what we see is basically there's no consistently strong strategy with respect to this on the other hand multi-dds our proposed method and another method that tries to stabilize training with some tricks do significantly better than these baseline methods so i've given a very brief overview to some of our work and i'd like to talk about what we know and what's next so basically we are currently building a powerful toolbox for cross-lingual learning this is a very active research area and as i mentioned data is a bottleneck but in another way human resources are a bottleneck as well so this is an example of a paper count at 2018 nlp conferences by the country of the person who was publishing the paper and what i think is really important to see here is for example africa and south america are not despite their linguistic diversity are not well represented on this map so i'm really excited by efforts such as masakane nlp which is an african initiative to try to get people from africa working on nlp on the languages they're interested etc so thank you very much and i'll be happy to answer any questions thanks everyone uh first thing with us and as you notice if you're paying attention to the chat we're having a few uh difficulties with posting messages um our speakers are here and we'll get to some of these questions in the q and a if we can't respond to them there hopefully we'll also get be able to get this fixed during this session um and uh after graham's great talk our next speaker is alex ratner alex ratner did his phd in the computer science department at stanford he's now moved to uw and is an assistant professor there uh he's focused on real world problems applied to many spaces but in particular uh very much around taking methods such as weak supervision and making them more formal and scaling them out with systems like snorkel and he's going to talk to us about some of those challenges today so with that let's take it away with alex let's talk hey how's it going so i'm alex ratner and i'm going to be talking today about some [Music] kind of practical notes observations and some of the techniques that i've been developing through the snorkel project at uw now and also previously and i'll be covering work from when i was doing my phd at stanford so hence the pastiche of logos there all around uh one approach to handling the the lack of of labeled data that that is often such a bottleneck to machine learning progress today via programmatic approaches to weak supervision and i'm gonna have a special emphasis in this kind of more casual high-level chat uh and and preface to the q a coming up on notes from the field from from lots of practical deployments both at stanford uw and out beyond that so i'll start with that and here i'm going to uh go a little bit deeper than uh then perhaps usual into the motivation of the problem because i think there are some interesting practical notes about how practitioners actually are using weak supervision both via systems and techniques like the ones i've worked on along with a myriad of other ones so i'll start with the the 40 000 foot level motivation which is that and again i think this will be redundant for this crowd but it's that machine learning development really has a new bottleneck today and it really centers around the data that these models learn from the so-called training data and so you have you know at a high level for a standard let's say iid classification problem you have three main ingredients you have some labeled training data that's labeled according to the annotation schema that you want to train the model to to output according to you have some kind of model architecture and obviously algorithms to uh to train it and then you have the hardware and the infrastructure that this rests on and it used to be that the model and the features and the the structure of the model and model architecture and all the hardware and infrared this was where teams spent their time on and got stuck on when deploying machine learning you know five ten plus years ago one of the most remarkable things that's happened over the last you know five years or so even is the increasing availability accessibility and and power of these last two steps so i often use the phrase commoditization and i think that's meant as a stunning positive for what the field has accomplished what open source offerings have accomplished in that if i want to get say you know a state-of-the-art solution to what often used to be a grand challenge problem machine learning like classifying images i can do this in several lines of python to get the latest and greatest uh model or a wrong example given my python code on the screen but you got my point and i can you know pick your pick my favorite or second favorite cloud provider whatever it might be and and uh spin this up and get a a really state of the art solution but of course this all relies on having the training data and the trinity that's carefully labeled and curated and managed according to the problem objectives and so i'll give an example from a paper that we actually just published in the in patterns based on work with several teams at stanford medicine and stanford hospital and this is just one of many examples that highlights not just that training data is is a bottleneck but it's that it's it's really a uh you know a very strident one that has orders of magnitude uh difference so in this example the goal one of the several goals of the different data sets was to classify chest x-rays for triaging so chest x-ray comes in should it be read urgently or can it sit in the queue and be read later by a human radiologist and uh given a label training data set that had taken in this case about eight person months to label uh the modeling took a couple days the vr collaborators they downloaded some of the state-of-the-art you know cnn and other image classifier models and the variance amongst those models was under a point in in the metric we were optimizing for rca you see on this binary classification task conversely the the tr

Original Description

Modern machine learning applications have enjoyed a great boost utilizing neural networks models, allowing them to achieve state-of-the-art results on a wide range of tasks. Such models, however, require large amounts of annotated data for training. In many real-world scenarios, such data is of limited availability making it difficult to translate these gains into real-world impact. Collecting large amounts of annotated data is often difficult or even infeasible due to the time and expense of labelling data and the private and personal nature of some of these datasets. This session will discuss several approaches to address the labelled data scarcity. In particular, the session will discuss work on: (1) transfer learning techniques that can transfer knowledge between different domains or languages to reduce the need for annotated data; (2) weakly-supervised learning where distant or heuristic supervision is derived from the data itself or other available metadata; (3) and techniques which learn from user interactions or other reward signals directly with techniques such as reinforcement learning. The discussion will be grounded on real-world applications where we aspire to bring AI experiences quickly and efficiently to everyone in more tasks, markets, languages, and domains. Session Lead: Ahmed Hassan Awadallah, Microsoft Speaker: Ahmed Hassan Awadallah, Microsoft Talk Title: Bringing AI Experiences to Everyone Speaker: Marti Hearst, University of California, Berkeley Talk Title: Summarization without the Summaries Speaker: Graham Neubig, Carnegie Mellon University Talk Title: Lessons from the Long Tail: Methods for NLP in the Next 1,000 Languages Speaker: Alex Ratner, University of Washington Talk Title: ML Development with Weak Supervision: Notes from the Field Q&A panel with all 4 speakers See more on-demand sessions from Microsoft Research's Frontiers in Machine Learning 2020 virtual event: https://www.microsoft.com/en-us/research/event/frontiers-in-mach
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 1 of 60

← Previous Next →
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
2 Frontiers in Machine Learning: Climate Impact of Machine Learning
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
3 Frontiers in Machine Learning: Security and Machine Learning
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
4 Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
5 Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
6 Remote Work and Well-Being
Remote Work and Well-Being
Microsoft Research
7 Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
8 Towards a Practical Virtual Office for Mobile Knowledge Workers
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
9 Impact of COVID-19 crisis on the future of work in India
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
10 Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
11 How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
12 Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
13 Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
14 Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
15 Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
16 Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
17 Directions in ML: Algorithmic foundations of neural architecture search
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
18 MineRL Competition 2020
MineRL Competition 2020
Microsoft Research
19 Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
20 From Paper to Product
From Paper to Product
Microsoft Research
21 SkinnerDB: Regret Bounded Query Evaluation using RL
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
22 From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
23 Programming with Proofs for High-assurance Software
Programming with Proofs for High-assurance Software
Microsoft Research
24 Platform for Situated Intelligence Overview
Platform for Situated Intelligence Overview
Microsoft Research
25 Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
26 Galactic Bell Star Music Demo
Galactic Bell Star Music Demo
Microsoft Research
27 Importing Animations in Microsoft Expressive Pixels (9 of 9)
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
28 Welcome to Microsoft Expressive Pixels (1 of 9)
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
29 Getting Started with Microsoft Expressive Pixels (2 of 9)
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
30 Creating an Image in Microsoft Expressive Pixels (3 of 9)
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
31 Creating Animations in Microsoft Expressive Pixels (4 of 9)
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
32 Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
33 Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
34 Using Layers in Microsoft Expressive Pixels (7 of 9)
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
35 Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
36 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
37 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
38 Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
39 Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
40 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
41 Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
42 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
43 Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
44 Novel Image Captioning
Novel Image Captioning
Microsoft Research
45 Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
46 Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
47 How does holographic storage work?
How does holographic storage work?
Microsoft Research
48 The physics of hologram formation in iron doped lithium niobate
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
49 Introduction to coax: A Modular RL Package
Introduction to coax: A Modular RL Package
Microsoft Research
50 Directions in ML: "Neural architecture search: Coming of age"
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
51 Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
52 Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
53 Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
54 Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
55 Clinical Research with FHIR
Clinical Research with FHIR
Microsoft Research
56 Soundscape Street Preview
Soundscape Street Preview
Microsoft Research
57 Tilt-Responsive Techniques for Digital Drawing Boards
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
58 SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
59 Haptic PIVOT: On-Demand Handhelds in VR
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
60 SurfaceFleet Supplemental Video Demonstration (UIST 2020)
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research

The video discusses the challenges of learning from limited labeled data in NLP and presents various techniques and tools to address these challenges, including weak supervision, semi-supervised learning, and cross-lingual transfer learning. The importance of mathematical foundations in ML and the application of fine-tuning techniques are also highlighted.

Key Takeaways
  1. Apply weak supervision to NLP models
  2. Use semi-supervised learning for limited labeled data
  3. Implement cross-lingual transfer learning for multilingual models
  4. Fine-tune pre-trained language models for specific tasks
  5. Combine supervised and unsupervised learning for improved results
💡 The increasing availability of hardware and infrastructure has commoditized the model, features, and structure of the model, but training data remains a significant bottleneck in ML development.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →