Multilingual and cross lingual embeddings - Nils Reimers

Cohere · Beginner ·📄 Research Papers Explained ·3y ago

Key Takeaways

Nils Reimers discusses multilingual and cross-lingual embeddings, highlighting the limitations of traditional search methods like BM25 and the benefits of using multilingual embeddings for handling multiple languages with a single pipeline. He also talks about the importance of preserving country-language specific properties and addressing language bias in models.

Full Transcript

so final topic or second final topic is like multilingual search and so that's the most recent project I've been working on at at kohir to work on a multilingual embedding approach so far the dominant method in searches like lexical search bm25 but if you go and have like a multilingual data set let's say you create like some platform I don't know you you create Reddit people post in all types of languages on Reddit or Twitter and you want to provide like search function for all these languages it's like really really ugly to build this first you need to do language identification on docs and queries because every language needs a different tokenizer so in English you can't just tokenize on white spaces and then indexes but this does not work for Chinese so for Chinese you need a different tokenizer than for English and so on also stock worthless must be different so in English you have certain stock words but these stock words are different for French German Chinese Russian Turkish Arabic also what you often do is like stemmer where you reduce words to the stem like docs plural docs goes to Dock and that's also like language specific so every document has like a different pipeline different tokenizer different stock words different stamma for every language you need a different index so you have an index for all the Arabic documents or the Chinese documents or the English document and then if I enter a search query I first need to know what search what language is this query in and this can also be challenging because there are words which it's like a beak if you take the English word die in Germany the word b written the same way as an article so just from the word you don't know what's the language and so you don't know which index do I need to hit do I have to search the English index or the German index so you might need to query like multiple indices and out of the box systems like an elasticsearch only support few languages because it's like really really painful to create these tokenizer stockwards demo for every language with multilingual embeddings it's quite easy so you take the model you take the text path through Transformer Network you get an embedding out of it you don't need to do any language identification you don't need to have like any stemming stop words and so on so everything can be the same pipeline but however there have been previous work on multilingual embeddings but a big challenge was the the lack of training data so what people did before is they used from the neural machine translation Community translate sentences and train models on this or light models English models on this but here we see that models only work well on the sentence level another line of researchers like people use machine translation like Google Translate to translate query answer pairs to other languages but here the models don't learn language and Country specific properties so that people have been using Ms macro which has like a lot of questions how to do taxes in the yes and how to file certain forms and then translated this to German so now the model knows in German how I could find my taxes in the yes but the model has absolutely no idea how to do taxes in Germany and how to file all the forms that's required in Germany to do my taxes and as mentioned before models are really bad and out of domain settings so if I hit the model and ask some German specific question how do I do how do I file taxes or how do I get a tax refund the model has like absolutely no idea how to do it and how to learn it so what we did at here we say Okay data as is the key so we collected large quantities of training data not from machine learning translation but actually written by people of that language so for example Germans asking questions how to file the taxes in Germany together with the answers written from Germans how to do the taxes or how to fast certain forms in Germany and overall we close collected up to like 500 million non-english pairs and which we carefully created cleaned extended augmented and so on which covers like a lot of different topics and so we hope that the models are really broadly applicable and can find relevant information across all these languages to minimize the Gap we see in these models have on unknown words and unknown domains and of course we we put it also to a test in different settings first was like clustering and search in English search on non-english languages and cross-lingual classification we have training data just in English and I want to do classification in other languages and here the biggest Improvement we saw for search in non-english where previous methods they all worked like on a sentence level and here we see like a really big boost of performance on the non-english languages and one approach which is absolutely amazing was these embedding approaches it's like cross-lingual search so you can type the query in any of the 100 languages so here it's Arabic I'm not speaking Arabic my colleague Amer was so kind to provide this but he said supposed to ask the question what's the capital of the United States yeah you search the English Wikipedia on this and obviously elasticsearch has no idea what this Arabic garbage is but if you do semantic search it has like no idea no problem to match this because this text is matched to a vector the vector is really close to the English what is the capital of the United States and then you get the perfect match to like Washington DC is the capital of the United States foreign aspect here's in terms of like language bias if you create these models so either some models they have like a strong language bias meaning they prefer certain language combinations so for example if you take the libse model you see all the Russian points are in the left corner or the English points and the right corner for other models it's more like mixed together and there's like no strong separation between languages foreign ERS like is language bias good or not or not good so side effects with language bias is that same language results are ranked higher just because of the language so if I search what's the capital of the United States in Arabic I preferably find Arabic search results in a multilingual Corpus even such that maybe the English hit is it's a better answer to my question um but so so you could think that a model was odd language bias is nicer that it finds you the perfect document the perfect information without language but here the challenge becomes in things that are specific to languages where languages are also really tightly coupled with countries so for example if I search in English for wedding in an image search system some happy picture like this so you know that's the traditional Western picture how a wedding should look like the the bright and the white dress and the man and the smoking will suit black suit uh white skin if I if the model does not know about the language I'm searching in and I search in the Hindi word for weddings exactly I'm not speaking Hindi but Google translate told me that's the only word for wedding and the the vector spaces does not have like any information about the language so it just sees the content in terms of wedding it will retrieve the same result and here it's like doubtful if people that search in Hindi for wedding would be really interested to get like a western picture of a wedding presumably such person would be more interested to get like some weddings how they are typically celebrated in India similar who's the president here we assume Joe Biden links to president but if you ask in in French who's the president you probably not so much interested in the US president but you're probably more interested in the French president or maybe I don't know you used to speak there are a lot more french-speaking countries maybe you're interested in in a in the president of the respective countries so there's the question like how can you still preserve these country language specific properties if you don't have a language bias in the model

Original Description

Sentence Transformers and Embedding Evaluation - Talking Language AI Ep#3 Full episode: https://youtu.be/apuDeylm1uE About The Speaker: Nils is the creator of Sentence-BERT and has authored several well-known research papers, including Sentence-BERT and the popular Sentence Transformers library. He’s also worked as a Research Scientist at HuggingFace, (co-)founded several web companies, and worked as an AI consultant in the area of investment banking, media, and IoT. === In our conversation, Nils gives us an introduction to the Sentence-BERT package and the large language models provided in it. He also shares some lessons from his experience in open-source development of such a popular package. Finally, Nils touches on his research collaborations on how to evaluate embeddings through works like MTEB: Massive Text Embedding Benchmark and BEIR. To go deeper into these tools, and other concepts around embeddings, watch the video and join the conversation on Discord. Stay tuned for more episodes in our Talking Language AI series! === Join the Cohere Discord: https://discord.gg/co-mmunity Discussion thread for this episode (feel free to ask questions): https://discord.com/channels/954421988141711382/1052547510910062624 Watch more episodes of Talking Language AI: https://www.youtube.com/playlist?list=PLLalUvky4CLJ9ZgtZguDJ7dAYuI1bfaYW === Resources: Bonjour. مرحبا. Guten tag. Hola. Cohere's Multilingual Text Understanding Model is Now Available: https://txt.cohere.ai/multilingual/ SBERT: https://www.sbert.net/ SBERT Paper: https://arxiv.org/abs/1908.10084 MTEB: Massive Text Embedding Benchmark: https://arxiv.org/abs/2210.07316 BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models: https://openreview.net/forum?id=wCu6T5xFjeJ SetFit - Efficient Few-shot Learning with Sentence Transformers https://github.com/huggingface/setfit
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Cohere · Cohere · 52 of 60

1 Andreas Madsen on Independent Research and Interpretability
Andreas Madsen on Independent Research and Interpretability
Cohere
2 Plex: Towards Reliability using Pretrained Large Model Extensions
Plex: Towards Reliability using Pretrained Large Model Extensions
Cohere
3 Independent Research Panel Discussion
Independent Research Panel Discussion
Cohere
4 The Future of ML Ops: Open Challenges and Opportunities
The Future of ML Ops: Open Challenges and Opportunities
Cohere
5 C4AI Special - Grad School Applications
C4AI Special - Grad School Applications
Cohere
6 Cohere For AI Fireside Chat: Samy Bengio
Cohere For AI Fireside Chat: Samy Bengio
Cohere
7 Cohere For AI - Scholars Program Information Session
Cohere For AI - Scholars Program Information Session
Cohere
8 Modular and Composable Transfer Learning with Jonas Pfeiffer
Modular and Composable Transfer Learning with Jonas Pfeiffer
Cohere
9 Jay Alammar Presents Large Language Models for Real World Applications
Jay Alammar Presents Large Language Models for Real World Applications
Cohere
10 Catherine Olsson - Mechanistic Interpretability: Getting Started
Catherine Olsson - Mechanistic Interpretability: Getting Started
Cohere
11 How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
Cohere
12 C4AI Sparks: Samy Bengio
C4AI Sparks: Samy Bengio
Cohere
13 BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
Cohere
14 Exploring News Headlines With Text Clustering | Jay Alammar
Exploring News Headlines With Text Clustering | Jay Alammar
Cohere
15 Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Cohere
16 Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Cohere
17 Intro to KeyBERT - BERTopic for Topic Modeling
Intro to KeyBERT - BERTopic for Topic Modeling
Cohere
18 Intro to PolyFuzz - BERTopic for Topic Modeling
Intro to PolyFuzz - BERTopic for Topic Modeling
Cohere
19 API Design Philosophy - BERTopic for Topic Modeling
API Design Philosophy - BERTopic for Topic Modeling
Cohere
20 Code demo of BERTopic - BERTopic for Topic Modeling
Code demo of BERTopic - BERTopic for Topic Modeling
Cohere
21 Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Cohere
22 How People can help BERTopic - BERTopic for Topic Modeling
How People can help BERTopic - BERTopic for Topic Modeling
Cohere
23 Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere
24 Cohere API Community Demos | October 2022
Cohere API Community Demos | October 2022
Cohere
25 Perfect Prompt Demo By Arjun Patel
Perfect Prompt Demo By Arjun Patel
Cohere
26 Project Idea Generator Demo By Tobechukwu Okamkpa
Project Idea Generator Demo By Tobechukwu Okamkpa
Cohere
27 SuperTransformer Demo By Amir Nagri and Team Megatron
SuperTransformer Demo By Amir Nagri and Team Megatron
Cohere
28 Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere
29 How Startups Can Use NLP to Build a Competitive Moat
How Startups Can Use NLP to Build a Competitive Moat
Cohere
30 Build Chatbots Faster with Large Language Models
Build Chatbots Faster with Large Language Models
Cohere
31 Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Cohere
32 Utku Evci - Sparsity and Beyond Static Network Architectures
Utku Evci - Sparsity and Beyond Static Network Architectures
Cohere
33 Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Cohere
34 Iterating on your data with doubtlab - Tools to Improve Training Data
Iterating on your data with doubtlab - Tools to Improve Training Data
Cohere
35 Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Cohere
36 Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Cohere
37 Building Cohere API Demo App With Streamlit | Adrien Morisot
Building Cohere API Demo App With Streamlit | Adrien Morisot
Cohere
38 Rosanne Liu - career creation for non-standard candidates
Rosanne Liu - career creation for non-standard candidates
Cohere
39 Giving computers many human languages with Cohere's multilingual embeddings
Giving computers many human languages with Cohere's multilingual embeddings
Cohere
40 Learning by Distilling Context with Charlie Snell
Learning by Distilling Context with Charlie Snell
Cohere
41 Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Cohere
42 Reflecting on for.ai...
Reflecting on for.ai...
Cohere
43 Create a Custom Language Model with Surge AI and Cohere
Create a Custom Language Model with Surge AI and Cohere
Cohere
44 Cohere API Community Demos | November 2022
Cohere API Community Demos | November 2022
Cohere
45 Cohere API Community Demos | December 2022
Cohere API Community Demos | December 2022
Cohere
46 Cohere For AI Presents: Colin Raffel
Cohere For AI Presents: Colin Raffel
Cohere
47 Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Cohere
48 What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
Cohere
49 Evaluating Information Retrieval with BEIR
Evaluating Information Retrieval with BEIR
Cohere
50 Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Cohere
51 High quality text classification with few training examples with SetFit
High quality text classification with few training examples with SetFit
Cohere
Multilingual and cross lingual embeddings - Nils Reimers
Multilingual and cross lingual embeddings - Nils Reimers
Cohere
53 Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Cohere
54 Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Cohere
55 HyperWrite Powers Its Generative AI Service with Cohere
HyperWrite Powers Its Generative AI Service with Cohere
Cohere
56 EMNLP 2022 Conference Special Edition - Talking Language AI #4
EMNLP 2022 Conference Special Edition - Talking Language AI #4
Cohere
57 Cohere API Community Demos | January 2023
Cohere API Community Demos | January 2023
Cohere
58 C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
Cohere
59 Michael Tschannen -  Image-and-Language Understanding from Pixels Only
Michael Tschannen - Image-and-Language Understanding from Pixels Only
Cohere
60 How to Add AI to your App
How to Add AI to your App
Cohere

This video discusses the challenges of multilingual search and the benefits of using multilingual embeddings. Nils Reimers highlights the importance of preserving country-language specific properties and addressing language bias in models. Viewers can learn how to apply RAG basics and vector stores to multilingual search and evaluate language models for cross-lingual search.

Key Takeaways
  1. Collect non-English pairs of text data
  2. Use cross-lingual search to find relevant information
  3. Improve performance on non-English languages in search and classification tasks
  4. Evaluate embedding approaches for matching queries in any language to vectors in other languages
  5. Address language bias in models
💡 Preserving country-language specific properties is essential for multilingual models, and language bias in models can lead to incorrect results.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →