Abstract: Large language models (LLMs) show promise in legal question answering (QA), yet Thai legal QA systems face challenges due to limited data and complex legal structures. We introduce ThaiLegal, a novel benchmark featuring two datasets: (1) ThaiLegal-CCL Dataset, covering Thai financial laws, and (2) ThaiLegal-Tax Dataset, containing Thailand's official tax rulings.Our benchmark also consists of specialized evaluation metrics suited for Thai legal QA. We evaluate retrieval-augmented generation (RAG) and long-context LLM (LCLM) approaches across three key dimensions: (1) the benefits of domain-specific techniques like hierarchy-aware chunking and cross-referencing, (2) comparative performance of RAG components e.g. retrievers and LLMs, and (3) the potential of long-context LLMs to replace traditional RAG systems. Our results reveal that domain-specific components slightly improves over naive methods, while existing retrieval models still struggle with complex legal queries and long-context LLMs have limitations in consistent legal reasoning. Our study highlights current limitations in Thai legal NLP and lays a foundation for future research in this emerging domain.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,legal NLP,datasets for low resource languages,retrieval-augmented generation,domain adaptation,logical reasoning
Contribution Types: Data resources, Data analysis
Languages Studied: Thai
Submission Number: 3272
Loading