Abstract: Citation-sensitive legal question answering in low-resource settings, such as Thai law, poses unique challenges for large language models (LLMs). We investigate how to align large language models for citation-sensitive legal question answering in Thai using Group-Relative Policy Optimization (GRPO). Focusing on affordable alignment, we compare semantic similarity–based reward proxies against large LLM judge models. Experiments on the NitiBench benchmark show that semantic reward achieves competitive performance in in-domain settings, with up to 90% Citation F1 improvement over instruction tuning and 2.5× reduced compute cost compared to judge-based supervision. Ablation studies further reveal the importance of answer-level reward components, while correlation analysis supports the partial validity of semantic signals as reward proxies. These results offer actionable insights into affordable and robust alignment for legal LLMs.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: legal NLP,Question Answering,LLM Efficiency,NLP in resource-constrained settings,fine-tuning,reinforcement learning,retrieval-augmented generation,domain adaptation,Interdisciplinary Recontextualization of NLP,semantic textual similarity
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: Thai
Submission Number: 5230
Loading