Chat2Find Publishes 255M+ Token Sri Lankan Trilingual AI Corpus on Hugging Face and LankaData

📰 Medium · LLM

Explore the Chat2Find Corpus, a 255M+ token trilingual AI dataset for Sri Lankan languages, now available on Hugging Face and LankaData

intermediate Published 12 Apr 2026

Action Steps

Access the Chat2Find Corpus on Hugging Face
Explore the dataset's metadata and documentation on LankaData
Apply the corpus to fine-tune LLMs for Sri Lankan languages
Use the dataset to train and evaluate NLP models
Compare the performance of models trained on this corpus with others

Who Needs to Know This

NLP engineers and researchers can utilize this dataset to improve language models for Sri Lankan languages, while data scientists can apply it to various NLP tasks

Key Insight

💡 The Chat2Find Corpus provides a large-scale trilingual conversational dataset for Sri Lankan languages, enabling improved NLP capabilities