Pedro Ortiz Suarez & Laurie Burchell – Expanding Linguistic and Cultural Coverage in Common Crawl

Cohere · Intermediate ·🧠 Large Language Models ·1w ago
The Common Crawl Foundation (CCF) provides the largest open corpus of web data, enabling a wide range of scientific and technical applications including large language model (LLM) development. However, our current data processing pipeline faces challenges when processing multilingual data, decreasing language representation and impacting downstream model performance. In this talk, we will discuss CCF’s initiatives to improve multilingual coverage and language identification of our web corpus. These efforts include soliciting crowd-sourced web seeds for under-served languages, running the First Workshop for Multilingual Data Quality Signals at COLM 2025, and creating CommonLID, a community-driven, human-annotated language identification benchmark for the web domain. Throughout, we emphasise the collaborative nature of our efforts, working in partnership with members of the NLP community to improve content available in their languages. Pedro Ortiz Suarez is a Principal Research Scientist with the Common Crawl Foundation. Pedro is a mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro's research has mainly focused on how data quality impacts ML models' performance and how to improve these models through data-driven approaches. Laurie Burchell is a Senior Research Engineer with the Common Crawl Foundation. They hold a PhD in Natural Language Processing from the University of Edinburgh, focusing on fast, high-coverage language identification and its part in creating reliable multilingual corpora. Laurie is especially interested in using data-driven approaches to make language technologies as multilingual as possible. This session is brought to you by the Cohere Labs Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Kato Ste
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →