Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala

📰 ArXiv cs.AI

arXiv:2601.14958v3 Announce Type: replace-cross Abstract: The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinh

Published 11 May 2026

Read full paper → ← Back to Reads