Accelerating Multilingual RAG Systems

Microsoft Research · Beginner ·🧠 Large Language Models ·1y ago

Key Takeaways

The video discusses the acceleration of multilingual Retrieval-Augmented Generation (RAG) systems, introducing a comprehensive multilingual RAG evaluation framework, and presents the Miracle dataset for multilingual retrieval evaluation. It covers the construction of the Miracle dataset, retrieval evaluation in RAG, and the use of various models and tools for multilingual RAG systems.

Full Transcript

um yeah thanks a lot for sanit inviting me and um yeah excited to give this talk um so yeah let me let me start so today I'll be mostly talking about a bit of three of my Works in general um where I'll be focusing on how do we accelerate multilingual rack systems uh focusing on retrieval uh relevance and uh generation evaluation so uh to start with a bit about me so I uh did my undergraduate from bits bani Goa and then for a few time after my graduation I was a software engineer at nocape um I think so back in 2021 I kind of uh 2019 sorry I kind of Switched as a NLP uh research assistant at ukp lab and since then um I'm currently doing my PhD at University of Oru just started my fourth year here and also um did two internships uh previously one at Google research and more recently at uh data bricks um so yeah you can find me on Twitter at this handle so it's beermug and my email is also mentioned here so let's start to talk with in general talking about um native speakers across the world so according to the survey there have been about 7.2 billion people on Earth and for this survey at least 6.3 billion people were included uh out of which 4.1 billion people have kind of speak one of the 23 most spoken languages as a native tongue so if you see this chart on the right where I have enlarged it uh you can find that uh different languages are mentioned and the native speaker count is mentioned in millions so these contain many of these indic languages you can find Hindi uh Telugu sort of Tamil Bengali marati as well um moving on let's look at English so if you see English English is a small icon here which I have shaded um so if you focus on English actually only the 4.6% of the speakers worldwide are um have their first language as English so there are only 4.6 L1 speakers currently and the moment you include L1 plus L2 speakers you only uh come up with .1 person so uh what I really wanted to show is that still there is a lot of um huge amount of speakers kind of which are present around the world and who do not speak English um yeah I'm getting some feedback from the conference room I feel like soce yeah one second yeah no worries um that's yes uh yeah that that's that's great awesome um yeah sorry about that but yeah so this kind of like what the motivation was to kind of show that you know we like not just all research should be kind of focused on English but we should also focus on different languages which caters to more speakers across the world um one of the motivations of why I wanted to work on multilingual Rag and personally um I find these kind of options or choices to be um important the first is more inclusive and uh to provide better resources so what I want to do is kind of provide resources to the community which would hopefully encourage them to drive more research in this particular direction um more empowerment and participation uh what I really like is that if you do something more local um people within our family or you know in our extended family they can actually appreciate more um use these systems also in their native language so this kind of empowers them and also helps them participate in these applications and third is I want to talk Target a much more wider audience so once you do multilingual rag you can access documents from multiple different languages so ideally you have a much more richer source of information and you can also Target a much more wider audience um so yeah today uh I'll be kind of coming keep on coming back to this diagram uh which I've called it kind of a swift teas diagram uh where um I basically it shows how a typical naive multilingual rag pipeline is constructed um so the question is uh something like you know how Taylor Swift's age affects a relationship and this a user will provide and you then first would give it to a retriever model uh which has certain human generated documents in a corpus uh for example in our case we are using Wikipedia in a certain language and you retrieve topk documents then given the question um once you have retrieved your sort of topk documents for a given question you use it along with the question use a prompt and provide it to the generator model and after the generator model gets the question and the retrieval documents you basically have the rag answer generation stage which the generator model um answers using context present in the top kid dos and using uh the question as well so uh the first part where I mentioned so I'll be talking about retrieval so this is this comes in this area which talks about multilingual retrieval and how we can evaluate uh different retrieval systems on that uh the second part uh where I'll be talking about is relevance assessment uh where given a question and given a list of topk retriev documents uh we are kind of evaluating does the generator model really understand that these topk documents as a subset will be able to answer the question or not and the third part will will focus on the generation part so in this case I would be evaluating um given the generator model and the answers um how well are the answers formed do their site uh appropriately and uh whether they are fluent or not um a little bit on why I kind of wanted to work on these projects so this is kind of a tech snippet uh of a really nice uh blog post uh which Omar cut up kind of mentioned um on September 4th um so I definitely do recommend you can kind of read this blog post um what I really wanted to focus is on point one here uh where Omar mentions that you know ideally uh you should invest in projects not papers and this is kind of like have been my motivation in all the multilingual rag work uh which I'll be presenting today um informally I would have named this talk to be something like a multilingual rag Universe uh where I kind of find all the three parts uh crucial uh in in basically for multilingual Rag and um yeah all three to together kind of forms like these uh the team which is required to kind of accelerate uh rag across different languages so uh moving on now to the first part uh where I'll be talking about Miracle which is basically a multilingual uh retrieve AAL data set covering 18 different diverse languages so this was a work done by uh Christina shuang and uh myself and uh some of these slides have been modified from Christina's so uh like I mentioned so in part one we wanted to focus on retrieval evaluation in drag so just to remind you from our diagram so this is the part where you have the question you do retrieve you do a retrieval and then you have your topk uh retrieve documents which are then evaluated uh using a metric um so the objective for this is to that overall we want to see whether a rag system can uh retrieve the most relevant Source documents or not um the setting which we follow in our work is monolingual so this means that both the queries and the documents are in the same language uh for example it can be Hindi or some other language and um the input for retriever part is you would get uh user question which is uh provided as input and you need to Output a rank list of uh passages as the output so let's look at some uh overview and some statistics from the from the miracle data set so it provides you high quality uh training and evaluate and evaluation data for monolingual passage retrieval um it was actually constructed using a large scale human annotation so we had to spend about like like five to six months on uh actually uh getting all the annotators uh getting all their work like kind of managing them so it took us about 5 to 6 months actually to construct the whole uh data set so it has about 10K hours of work from 31 different annotators uh who are native speakers of different languages um it contains 18 different languages so what I plot here is that uh plot the speaker count and the Wikipedia size which we find across different language families so if you see English is the only one which is on top where you have um a huge amount of Wikipedia information available uh within within itself but more or less uh for other language families you can find that there is kind of a linear increase uh where the more amount of speakers means that you have more amount of uh information present in Wikipedia um lastly just to compare with some other data sets Miracle contained like like 726 uh 726 th000 human annotations relevance judgments across all languages so this is really much higher than all the other data sets which are present or which were present back in 2022 um during this introduction of the work um so how miracle was constructed was we kind of uh went through the literature itself and uh try to follow what was already done previously in literature so our Construction was very similar to how TIY QA was constructed um in two phases so the first phase involved query generation and the second one was relevance assessment so in uh phase one query generation in very uh Layman uh simple words was that you would have a Wikipedia dump of in a particular language so you would scrape it you would get the Wikipedia dump and then you would convert that into a passage collection and uh using that what we formed are some small Snippets of 100 words uh which basically contained the initial information about a Wikipedia article which was provided to these native speakers and actually then used to generate uh certain queries about it so this was phase one where given some queries you generated um questions um using these prompts and one important thing to note is that we wanted the annotators to generate queries which cannot be answered by these prompts itself so we wanted to generate some related queries uh but not exactly answerable by The Prompt itself in phase two we moved on to actually taking these queries um do a hybrid retrieval stage where we incorporated multiple different uh models such as bm25 um mdpr and M Colbert uh we retrieve candidate passages top 10 candidate passages and then use that uh provide that to our native speakers which provide us back the relev judgments of a query passage pair whether it's relevant or not relevant so this is the phase two of the um of the Miracle Construction um miracle was kind of introduced back in 2022 and uh we had like a task a competition around it in westom uh 2023 or 2022 and there was a lot of participation amongst uh amongst you know many teams as well and ever since we also find that Miracle is currently now being well adopted within the community so many current many multilingual retrieval models such as like gecko embed V3 uh the multilingual E5 models have been using Miracle as both as a either the training data set part of it or the evaluation uh part of it so I compiled a few uh stateof the art models currently which I found their numbers on their papers and I compil them so you can see there are like different models here such as like you know the embed uh multilingual V3 model which is by coher the bm25 uh contri um mxtr and similarly the E5 models and also lastly the BGE so this is like a huge table basically 16 different languages all plotting and dcg 10 um just to give you a tldr on what we find currently in Miracle is that BG M3 is the model which currently achieves the best performance on Miracle in terms of ndcg at10 um overall we find that hybrid search works well uh for multilingual retrieval so um we also find that BG M3 actually involved both dense SPS and lexical representations so uh we kind of see that in this direction of hybrid search is what uh is best for multilingual retrieval and um unlike what is been seeing in English I still find that commercial apis are currently not significantly better than the open source Alternatives uh which are there at least on Miracle so um moving on now so we talked about retrieval now let's kind of dig into the second part where I'll be talking about um the relevance assessment so in this case I'll be talking about our work which is knowing uh when you don't know a multilingual relevance assessment data set for robust track um so this was recently accepted at em NLB 2024 in the findings track so in this part two um you remembered in part one we talked about retrieval so in part two we would talk about relevance assessment um in this case how we do the assessment is that our objective is given um a set of topk retrieve documents and a question we actually want to evaluate whether the llm can help identify relevance in the retrieve context or not we in our case we considered Oracle documents so uh we actually did consider them or the top 10 passages which got annotated and use those for our paper and um lastly how we labeled uh the task was more of a binary classification task um in a way we kind of say whether or we evaluate whether an llm can identify the relevant passages or not so no Miracle um more or less talks about um the multilinguality in context understanding so our motivation is that you know often there are retrieval errors in rag scenarios and it can provide you with bad search results so bad search results means that all of the passages are non-relevant um so in these cases ideally in a in a real world scenario you would either want the generator to Output saying something like I do not know the answer for the query or you would want it to retrieve again um but ideally you should be able to capture these kind of cases and um so if you see the example on the left um this is one of the examples like for example the query asked by a user was uh what does the AC button on on the calculator stand for and the top documents the first one talks about actually Power Electronics which talks about the ac voltage controller and the second one actually gives you the calculator but it doesn't talk about the AC button itself so in this case you actually found that you know a majority of these documents are really not relevant or they cannot be used to answer the question of what does the AC button in the calculator stand for really so our research question for the paper was more like um do llms understand these relevance across languages like does it really understand whether your topk passages is contains relevant information to answer the question or not and in order to do this uh we do an evaluation in a binary classification objective where we kind of constructed two subsets uh the non-relevant subset and the relevant subset so the non-relevant subset U more or less contains all the passages to be non-relevant whereas the relevant subset would contain at least one passage which is relevant to the input question like I mentioned so we formed an experimental setup where we construct the relevance evaluation objective as a binary decision Tree in a way so you have um yes answer is present in wrong uh because you know in the non-relevant subset you really want the model to Output I don't know whereas in the relevant subset you want it to Output yes the answer is present and if the model says I don't know that's incorrect so um yeah in the non-relevant subset you expect the llm to refrain from answering in the re in the relevant subset you would ideally want the llm to recognize the relevant passage and provide a valid answer to it um in order to measure this I we used uh we introduce two metrics one is the hallucination rate uh which I kind of classify or which I label as errors on the non-relevant subset so in this case if a model says an answer is present in the non-relevant subset this considers to be a hallucination um on the other hand what I consider to be an error is when a model says I don't know on the relevant subset meaning that it's not able to figure out uh the relevant passages in within the context I have a question yeah everything here is monolingual right if the question is in English the relevant uh documents would be you'll be evaluating for English right that's correct yes yes our setup is monolingual um yeah I think so I'll talk about that in the end um yeah I think so we started with miracle and then in general we had data sets which were present in monolingual form so I kind of extended that um but ideally I feel like in future we can also extend this to like crosslingual and M lingual scenarios okay um yeah moving on so how we actually constructed n was um actually it happened during Miracle Construction itself so how so remind you like in Phase One of Miracle actually we would want human annotators to construct queries um using the whole small 100w text for inspiration uh what we found was like a majority of these generated queries um and once we did retrieval for it they all came out to be non-relevant and there were two three reasons for this one we found the majority reason is that the often the query the information which is asks is probably out of scope in Wikipedia or second it you know it may contain some spelling mistakes in the query which is leading to like incorrect retrieval so these queries and passages were kind of wasted in Miracle because in Miracle for a retrieval subset you actually need relevant passages given a query but they were utilized in our case to form the non-relevant subset in no Miracle so we were able to efficiently utilize and not waste this um all this effort which went into annotating these query and the passage uh labels so just to remind you um so The annotation phase was exactly similar as the miracle one so you had the phase one where you would generate the annotator would generate a query then it would go through a retrieval system and you would retrieve some topk retrieve passages the only difference is right now whenever we have all the passages which were labeled as non-relevant we found that to be the non-relevant subset in no miracle and correspondingly the relevant subset of no Miracle is the miracle data set itself so these are some stats if you're interested you can look into the paper but overall we had a huge amount like we have almost about uh 40 to 50K in total uh pairs in for for evaluation um so we evaluated 14 different llms across five different families so this included both both closed and open source models um like gbd 3.5 gbd4 gbd 40 um mistal 7B mistal 8 cross 7B and um the Orca models a model and the Llama model I have another question so like for the uh interested like why did you choose ala because like the models are trained on a synthetic data the whole setup for the Ora models is train a model using the synthetic data unit So like um it the data itself might have certain problems so like uh would the results from those models actually translate to the r because the rest of the models are like uh have a mix of both synthetic data as well as no data from the web but Ora is an interesting choice any reason why did you choose that one then I think yeah yeah I think so so one reason was um I'm not sure when when exactly 53 came out but the no Miracle paper was like back in 2023 early um 2020 2023 summer sometime so I feel like the 53 models were not available at that point of time um so so that's why we initially at point that point of time Orca was present and that's the one we chose um for the specific reason for why we chose Orca I I currently don't remember I feel like we were evaluating at that stage like we wanted to uh evaluate different models uh and we found like you know Ora was kind of being U like I think so at that point of time Orca was heavily in the news as well so I just wanted to try it out yeah um but yeah right now I think so when I'll move on to my third like to the third part of the talk I'll I do have 53 and their results uh for the generation part on that case um okay yeah so so this is basically um a high level summary of all the the results in numerical um so here you can see on the x-axis I plot the accuracy uh on the non-relevant subset so the more towards 100% the better and um on the y- AIS I plot the accuracy on the relevant subset um so if you actually follow the flag um this is is ideally you would want your models uh to perform on this on this end of the graph so you would want them to perform well on both the non-relevant and the relevant subsets so what we found was um in general a lot of existing like Lama 2 Orca 2 and a models were like they would do well on the relevant subset however on the non-relevant subset um they achieved over 80% hallucination rate um indicating that they would unable to identify non-relevant passages or say that you know none of the passages can answer the question um secondly we found that you know interestingly mistal mixl and uh llama 38b actually so they are more or less they suffer less from hallucination but they have difficulties in observing um the relevant passages so they observe more of a over a 40% error rate on the norm numerical relevance upset and third what we find currently is that um like gbd4 gbd4 uh llama 370b and uh similarly other open large Source models provide the best trade off between the relevant and the non-relevant subsets at least in nerle uh we also did some prompt optimization or we did an ablation where we kind of played around with different prompts so the thing which which is mentioned in Black was our vanilla prompt so since we included 10 different passages um our overall our sequence length was really high um so that's why we kind of chose to do a listwise um like listwise without any exemplars uh and the prompt so our vanilla prompt was Zero shot um we played around by repeating the int the repeating the instruction providing some roles or explaining the prompt um in terms of results what we found was that explanation prompt kind of reduced the hallucination rate on the non-relevant subset on average um whereas the roll and repeat ones kind of reduced the error rate um on the relevant subsets so um tldr was kind of like we found that prompt optimization led to mixed improvements across um you know different models uh in in in no miracle I have another question yeah so like uh usually when people talk about R people these is also talk about context length of the model so like uh models like open source model as well like with larger context and so do you think that could also change or make an impact on these metrics like uh in the current setup I think you fix the pro to the maximum I think right but what do you think like if um we had a much larger context like um could the these metrics would have changed like for the better or for the worse like maybe like using I think M now does have a larger context model or llama also I don't know llama has it or not but like yeah like if you have run any recent experiments or anything what do you think about that yeah it's a good question so personally like at least for miracle um so all our passages were broken into like small uh segments of Wikipedia which in general short and um often what I found in retrieval at least is that um you find a majority of your answers or the part which is relevant to the question is present within the top K words um so really I don't think so our data set really evaluates or can evaluate long context evaluation um so I do feel like even if you add uh longer contexts at least the results should remain same for no miracle and also Miracle uh the reason being that right now we truncate each passage to about like something like uh 380 tokens so that we have a context input sequence length of 4K tokens so U 380 multiplied by 10 was about 3.8 and we gave about 100 200 as buffer for the uh for the instruction um so so I do feel like in in our case even if you extend the context length um kind of the results would be similar but in general it really depends upon the like what data set are you evaluating if you are kind of dealing with law or legal sort of corpuses where your data set is really huge um the PDF documents are huge and you need to reason across different uh pages in your PDF document for example if you have a 100 long 100 page long legal document and you want to refer and reason I feel like for those scenarios long context would be uh definitely useful but for miracle um yeah I I don't see adding more context hopefully is possibly helping it because I personally think a majority of the relevant information lies in the first topk 400 wordss if that help helps uh one last question so with respect to uh number of uh tokens that you fixed is it constant across every language because uh other languages might have a higher token requirement for the same amount of information that let's say other languages might not for example Hindi or um other languages which are morphologically rich languages might require a higher token count to represent the same information where you want to retrieve so uh did you face some problem there to handle like you know maybe some tradeoffs with rates or hallucination rate you know like the metrics going up and down because of just simply because of the amount of information you cannot fit to the model yeah that is that is a good point so I do I do find that actually so I'm not sure how many of these have basically um a tokenizer with about a 100K vocabulary size but in general like for lower resource languages like we found for Telugu like the the the amount of tokens required to represent a sentence what much higher um but it kind of um it kind of uh didn't affect our scenario much because somehow the Wikipedia articles in Telugu also were much shorter than what it was in English um so in general like you would find if you actually check a Wikipedia article It is Well very well explained in English but the same Wikipedia article if you try to find it in Telugu or some other low resource language uh for example what I've seen is that they they are usually less explained but um so I kind of find so right now I I choose a like a fixed sequence length of across all languages so this can definitely penalize lower resource languages meaning that I'm incorporating less information from each passage um but yeah right now for some of these models I don't remember which ones but I had a explicit limit that I cannot extend more than 4096 um so that's why I chose to ended up choosing 4096 just to be fair um across all the different models and their settings okay um yeah thanks for the questions let me move on to the third part of it so this is the generation part uh where I'll be talking about Mirage bench which is an automatic multilingual Benchmark arena for rag systems so this was a recent work which which appeared in archive I think so a few months ago so like I mentioned so in part one part two we talked about retri and relevance so in the part three we're now focusing on generation where we are more specifically looking at the rag answer and how well um how well you know yeah we want to evaluate the quality of the generated rag answer of multiple different you know generator systems basically different multilingual llms um we also consider Oracle documents so these were basically top 10 passages which got annotated in miracle and another important setting in our case was we didn't have any gold Truth for the ANS so we don't have any human generated answers um in our case um so what I found typically evaluation in rag benchmarks um in in two different ways uh what I found like there are there are Community which focuses more on traditional or heuristic based so um and also there is community which uses llm as a judge to sort of evaluate different rag outputs um So htic based features um in in this case you know people or the community would handcraft certain different metrics or certain different features and use that for evaluation so the pros are you are evaluating your rag systems across multiple different features or dimensions and it it is quick to run these features if you are using more traditional features um some cons are that you know comparison is difficult amongst models because you have 10 different features with 10 different numbers across 15 different models so you really don't know which model is good so sometimes comparison across models is difficult and often at times you do require the gold truth to be able to come up with uh traditional based metrics on the other hand when you use llm as a judge um the pros are you don't really need a gold Truth uh you don't not need to handcraft features um and but certain cons are that you know it can be really expensive to use uh highly performing LM such as you know gbd4 or gbd4 so it's very expensive to do that and often there are like issues with biases towards the judge and what kind of a judge you use so our motivation in the paper was like why not we try to combine The Best of Both Worlds so we see that there is traditional and llm as a judge let's try to combine both so how we actually come up combining both is we we try to learn a learning to rank model as a surrogate judge um so what it does is it would actually replace the expensive llm as a judge so in our case we use gbt 40 and it it trains on the computationally cheaper htic features so to give you like a um sort of an arena Benchmark in the end so when I look at my pipeline um so first what I have in the first stage is that I would generate rag answers for queries and given passages so with all the Baseline models I would first generate certain drag answers second I would compute all the different juristic features so given the answer um for example one example is basically citation quality so I compute citation quality across all the different baselines and and similarly for all different urtic features third is I would compute the llm as a judge score so this means that given the um given the answer from Model A versus model B basically we use a judge to say which model has a better rag answer and combining both theistic features which are actually used to train this learning to rank model and the labels are actually coming from the uh Bradley Terry model Logics where I use llm as a judge course we are able to uh judge us we are able to train a surrogate judge which can output uh basically a leaderboard uh mentioning which model does well um in the in in basically for multilingual rag so uh during inference what good part is that you kind of can get away not doing llm as a judge so the moment you have a trained learning to rank model you only need to compute the juristic features and use the model at inference to get or to add newer models to your leaderboard so it's quicker to run at inference um certain humanistic features which I chose for rag evaluation so uh we looked into like multiple different aspects uh which have been followed in the literature at least both for NLP and also for IR systems so we look into the language whether you know the the system outputs in the correct language or not which which is expected um the citation quality whether you know it cites the relevant passages uh support or faithfulness which means that you know the sentences which are cited are they really supported by information present in the cited passage ranker score which just computes the passages which were cited and the query ranker score um answer overlap which kind of computes the answer overlap between um the gbd4 answers versus the system response so in our case since we didn't have any gold Truth for the answer uh we are using gbd4 answer as kind of um as kind of for the the reference and lastly we also compute fluency as a metric um on the other hand for llm as a judge featur so we did evaluate um on a pairwise setup with GPT 40 so the computational cost for this is O of n square if you do exhaustive U because you want you would compare each pairwise combination of um n where n is your count of models which you are uh evaluating um so in the prompt if you see we ask it to talk about we ask it to consider different features such as you know correctness helpfulness completeness accuracy so on and so forth and we provide the user question the reference documents and both the answers uh of both the models and ask the judge to say whether you know a is better U give your explanation and then give a final verdict saying that whether a is better B is better or c for a tie so it's more more or less something similar to like a um Chain of Thought reasoning prompt um we also tried pointwise and listwise evaluation internally um because I find that both of them are much more efficient than using pairwise setup um but the reason why pointwise didn't work was we found that many of these baselines would actually get an identical score so for example like we evaluated about 18 models and 15 of them achieved a score of 4.0 and uh we found that even listwise was more difficult because listwise involves ranking all of the models in a particular list way and that was even more difficult for the judge to um judge to order so what we found worked best in our case was the pairwise setup uh using gbd4 as a judge um so how we trained or how we designed the surrogate judge uh was uh we in our case in the beginning we actually used random forest for Simplicity and it's also can be easily trained within just minutes using a CPU um we also had to do bootstrapping uh the reason is that you know we had only a few amount of pairs um which we which we were able to judge using llm as a judge um so about 50 to 100 queries or query passage pairs uh 502 50 queries so this means that almost 5,000 query passage pairs or it depends upon sorry um the model pawise combinations but yeah we use bootstrapping to actually generate more training data and also help us better estimate the variance when we are doing the rank prediction and also we use the Bradley Terry model which kind of was originally used in LMS to um assign different scores based on pairwise comparison uh where you know the Bradley Terry model you can have a look at it so it gives you the log odds of model I beating model j uh given in using this probability Pig um yeah so if you're interested you can look through it through our paper but what we do is we do multiple tournaments in a way we for example conduct 200 tournaments in which we make all the models fight each other and with each tournament we bootstrap a certain set of queries and using that bootstrapping we get uh the final ranks which are then basically meaned and averaged over the number of tournaments which we play um so some experimental details like Mirage bench was actually constructed using um the miracle death set itself so we had queries and you know oracle judge documents from the miracle data itself which was you know a really high quality data which we invested a lot of time in constructing um so Miracle evaluated the retrieval part of it so just to remind that you know in Mirage bench we are evaluating the answer part of it so we break the rag answer into a reasoning and a answer so you see there are two places and this is what we uh use for evaluation in Mirage bench um in our case just to answer question before so we did evaluate more recent models because this work was I think so just only a few months old so we did do evaluate like GPT 3.5 440 mril all most variants fight three instruct models uh which were I think so one of them is multilingual um and also the command R command R plus I models Lama 3 GMA 1.1 and the quen 2 models so um I have a question in the last so uh I see like you use did you use mixture it's a base model right so the instruction following capability of that model would not be probably exist so uh like yeah I think so I used the instruct model for this one I just forgot to add the instruct in front of yeah got yeah yeah makes sense yeah sorry about that that's a type on my side yeah um yeah but just to remind other people I should have mentioned this all models are the instruct versions so I tried I've used all the instruct versions for all models you know in in in both for no mirle and as well as Mirage bench um yeah so Arena based results so this basically shows you the rank of different models which are arranged on average from from top to down um so here a lower number means a better rank so that's why it's uh coming on top so what we overall a summary for what we find is that you know these close Source models like gbd4 or gbd4 llama 3 mixl um they achieve the top most ranks in our case we also found some very interesting observations like mostly we found rankings to be stable but some interesting observations were like you know 53 small we found performed really well across German English Spanish and GMA 1.1 performed really well on Telugu Swahili and Thai so if you see in Orange I've kind of highlighted uh these two parts where they achieve a much more higher rank uh in general in comparison to other models um so on the right hand side I plot the rankings which we get from using the surrogate judge um so I will not go through this in detail but overall we find that you know we get a kendl tow correlation of 0.9 or9 which shows which is a really high correlation which kind of shows that you know the surrogate judge works well at least for this case um we had some interesting ablations um so first one was a feature importance in the surrogate judge so in this case we found that you know using the heuristics actually the answer overlaps uh were the most important features and also the fluency so this is very interesting actually so you know the kind of goes maybe I can discuss this later why I think this might be the case but yeah in general um the surrogate judge which is a random forest model so you can get the feature importance is actually um gives a higher weight to these certain features uh we also tried um an ablation where we tried different feature combinations so in our case what we found was works best was if you remove these very low correlation features um actually you get the best um surrogate model instead of training for all features so actually kind of coming up with your handcrafted features and only um using those which you think are important works works good uh for training the surrogate judge um third is we did an exhaustive comparison so this means that we had 19 models so we did like 19 C2 combinations for each query um what we wanted to see was like can we use a lesser amount of queries or can we use um lesser amount of pairwise combinations uh to see whether you know how well do that how well does this would call relate to the to the original gold uh sort of ranking um so in general both our findings were that you know more queries the better and also doing an exhaustive pairwise comparison or more the pairwise comparisons you do the better you are in estimating the um the performance or or the or the rank performance of the model um lastly we also tried um some fine tuning on the Mirage bench train data set um so I tried with three different models I think so I took the mistal version 0.2 instruct uh 7b1 the Llama uh 38b um so what I found works really well was like the mistal 7B actually trained or synthetically fine-tuned so I did sft on the GPT 40 distal train data and it gets the position two which kind of does outperform uh llama 3 on these uh six languages which I uh used for my ablation choice so this kind of shows that the training data is also useful to improve the smaller models and to get better on the data set itself um so yeah let's now summarize our findings and you know briefly talk about future work um so just to recap basically we talked about three parts of the whole multilingual rack pipeline uh retrieval was the first part second part where I talked about relevance assessment and third part went into the generation part um some highlights and key takeaways first is uh data reusability so we actually con constructed a high quality data set in miracle and reused it for both no miracle and Mirage bench so this is what I found was kind of a crucial point in in in our in our whole research um third is uh so first is sorry for retrieval we found that hybrid search techniques current ly dominate on multilingual retrieval and the commercial apis we found were not significantly better than the open-source alternatives uh for relevance assessment we found that um multilingual llms like a broader category I think so older ones were unable to identify non-relevant passages so more or less newer and more recent models have better reasoning capabilities so this is kind of improving and closed and large open source models performed best and for the generation part uh what we saw is that you know you can kind of look into using surrogate judges which is much more um better in terms of cost and it can be used to approximate llm as a judge rankings um and also we found that on The Benchmark closed and larged open source models again they perform the best and finetuning can be you know useful to actually help improve performances of uh smaller open sourced uh llms um conclusions and future work so what I uh like to uh say is that you know it's it's like the whole part was kind of to show you a longer project so this involved across multip like at least two years or something so you know you should invest in longer projects and not just papers so we reutilize materials um one future work is that since our current scope was only Wikipedia we want to really extend extend this to different domains uh more realistic ones um but what I've often found is that you know Finding multilingual data and uniformity like for example if you want uh news data is available really a lot for certain languages but it's not available so much for some others so really finding this uniformity in UN in resources is often challenging um cross linguality is the one which we want to really explore after this so the whole um all three papers kind of delve into the monol setting um but ideally we want to extend to different cross linguality where for example given a Creme Brule recipe which I ask in Bengali I hopefully would find the best information in some French articles so um ideally I would want to do like that sort of a cross lingual uh rag setup where you have you know from like a x can be in any language and your why documents can be in any language um third is there's findings kind of accelerate or help improve llms for rag so they can be used to help accelerate or improve the multilingual llms I feel like you know this works have been um a motivation to uh for existing models how well do they perform and how can we improve them and lastly what I personally would hopefully do is continue investing because a lot of it it needs more effort towards benchmarking U maintaining code and needer boards and also continually checking cross progress like uh once we initially evaluated Lama 3 for no Miracle we found that it was much much better than Lama 2 so the findings do change with each and every hopefully subsequent version of the models which comes um so yeah so this was overall my talk sorry kind of have only very few minutes for for questioning but um yeah uh that so yeah thanks a lot for participating and in case if you have any questions yeah feel to ask yes has a question hey hi thanks for a great talk this was really insightful so maybe I missed this in your presentation but did you see um big differences between the performances in the different models in all your three sorry between performances in different languages in all the three settings that you um uh you know evaluated in so for example I mean we do expect that the low resource languages will perform poorly and the higher resource ones will perform well so I just wanted to ask if that's consistent with what you saw yeah that is consistent so what I personally found is that the um like Gemma models in general they work well for indic languages uh this is what one of my personal observations um but in general I've seen consistency in terms of like models performing well across High resource languages whereas lagging sometimes in low resource languages um but this is there are some caveats here so one caveat is that um uh for example for a low resource language such as Swahili the Wikipedia Corpus is only 50k documents so for but your English has about like 15 million documents or something so um so really I I sometimes wonder like let's say if we have a big enough Corpus in Swahili um whether the findings will change or not but it could be also that you know the the information present in lower resource language are not are like very dense and not really sparse so because they only talk about a few things yeah um so maybe this can be one of the reasons why certain models do well like for example if someone looked at all the Wikipedia corpuses of soili which is just 50k a model can easily remember what is there in each uh versus if someone had to do that for English because it contains much more information in general so um yeah I'm not sure about memorization might be one thing we cannot evaluate at this point of time but I feel like uh memorization would be a vital one to to see uh really whether you know is it because of Advanced reasoning or is it because of just memorization on Lower resource languages yeah thank you yeah thanks a lot now do we have any other questions already short on question cool uh all right um

Original Description

As Retrieval-Augmented Generation (RAG) systems gain prominence for grounding large language models (LLMs) in external knowledge, constructing evaluation frameworks is critical in accelerating developments across multiple diverse languages. This talk introduces a comprehensive multilingual RAG evaluation pipeline comprising three key components: retrieval, relevance assessment, and generation. MIRACL, a multilingual retrieval dataset with high-quality relevance judgments annotated by native speakers; NoMIRACL, a benchmark for assessing relevance in multilingual RAG, designed to measure LLM robustness against retrieval errors; and MIRAGE-Bench, an arena-based multilingual RAG evaluation framework integrating both heuristic metrics and surrogate judge models for multilingual generation evaluation. Together, these resources provide a foundation for advancing multilingual information access and enhancing the robustness of RAG systems. This talk highlights key findings from each section, challenges, and future work for multilingual RAG research. Speaker: Nandan Thakur, University of Waterloo, Canada
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 0 of 60

← Previous Next →
1 Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
2 Frontiers in Machine Learning: Climate Impact of Machine Learning
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
3 Frontiers in Machine Learning: Security and Machine Learning
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
4 Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
5 Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
6 Remote Work and Well-Being
Remote Work and Well-Being
Microsoft Research
7 Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
8 Towards a Practical Virtual Office for Mobile Knowledge Workers
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
9 Impact of COVID-19 crisis on the future of work in India
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
10 Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
11 How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
12 Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
13 Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
14 Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
15 Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
16 Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
17 Directions in ML: Algorithmic foundations of neural architecture search
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
18 MineRL Competition 2020
MineRL Competition 2020
Microsoft Research
19 Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
20 From Paper to Product
From Paper to Product
Microsoft Research
21 SkinnerDB: Regret Bounded Query Evaluation using RL
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
22 From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
23 Programming with Proofs for High-assurance Software
Programming with Proofs for High-assurance Software
Microsoft Research
24 Platform for Situated Intelligence Overview
Platform for Situated Intelligence Overview
Microsoft Research
25 Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
26 Galactic Bell Star Music Demo
Galactic Bell Star Music Demo
Microsoft Research
27 Importing Animations in Microsoft Expressive Pixels (9 of 9)
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
28 Welcome to Microsoft Expressive Pixels (1 of 9)
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
29 Getting Started with Microsoft Expressive Pixels (2 of 9)
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
30 Creating an Image in Microsoft Expressive Pixels (3 of 9)
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
31 Creating Animations in Microsoft Expressive Pixels (4 of 9)
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
32 Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
33 Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
34 Using Layers in Microsoft Expressive Pixels (7 of 9)
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
35 Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
36 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
37 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
38 Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
39 Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
40 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
41 Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
42 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
43 Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
44 Novel Image Captioning
Novel Image Captioning
Microsoft Research
45 Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
46 Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
47 How does holographic storage work?
How does holographic storage work?
Microsoft Research
48 The physics of hologram formation in iron doped lithium niobate
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
49 Introduction to coax: A Modular RL Package
Introduction to coax: A Modular RL Package
Microsoft Research
50 Directions in ML: "Neural architecture search: Coming of age"
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
51 Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
52 Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
53 Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
54 Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
55 Clinical Research with FHIR
Clinical Research with FHIR
Microsoft Research
56 Soundscape Street Preview
Soundscape Street Preview
Microsoft Research
57 Tilt-Responsive Techniques for Digital Drawing Boards
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
58 SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
59 Haptic PIVOT: On-Demand Handhelds in VR
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
60 SurfaceFleet Supplemental Video Demonstration (UIST 2020)
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research

The video introduces a comprehensive multilingual RAG evaluation framework and presents the Miracle dataset for multilingual retrieval evaluation. It covers the construction of the Miracle dataset, retrieval evaluation in RAG, and the use of various models and tools for multilingual RAG systems. The video aims to accelerate the development of multilingual RAG systems and provide resources to non-English speaking communities.

Key Takeaways
  1. Construct a multilingual RAG pipeline
  2. Evaluate LLMs using the Miracle dataset
  3. Optimize prompts for multilingual RAG systems
  4. Use hybrid search techniques for multilingual retrieval
  5. Develop multimodal LLMs for low resource languages
💡 Hybrid search techniques dominate multilingual retrieval, and commercial APIs are not significantly better than open-source alternatives.

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance
Medium · LLM
A simple way to test model fallbacks with RouterBase
Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface
Dev.to · routerbasecom
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →