Language corpora for the Dutch medical domain

📰 ArXiv cs.AI

arXiv:2604.25374v1 Announce Type: cross Abstract: \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first larg

Published 29 Apr 2026

Read full paper → ← Back to Reads