DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona

📰 ArXiv cs.AI

DALDALL is a persona-based data augmentation framework for legal information retrieval using LLMs

advanced Published 25 Mar 2026

Action Steps

Leverage LLMs to generate persona-based synthetic data
Apply domain-specific strategies to prioritize quality over quantity
Use the generated data to augment existing legal datasets
Evaluate the performance of legal IR models using the augmented dataset

Who Needs to Know This

NLP researchers and legal domain experts can benefit from this framework to improve the quality and diversity of their datasets, and ML engineers can apply it to develop more accurate legal IR models

Key Insight

💡 Domain-specific data augmentation strategies can improve the quality and diversity of legal datasets