Cost-Effective Improvement of Thai Legal QA using GRPO and Semantic Reward Proxies

Cost-Effective Improvement of Thai Legal QA using GRPO and Semantic Reward Proxies

ACL ARR 2025 May Submission5230 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Citation-sensitive legal question answering in low-resource settings, such as Thai law, poses unique challenges for large language models (LLMs). We investigate how to align large language models for citation-sensitive legal question answering in Thai using Group-Relative Policy Optimization (GRPO). Focusing on affordable alignment, we compare semantic similarity–based reward proxies against large LLM judge models. Experiments on the NitiBench benchmark show that semantic reward achieves competitive performance in in-domain settings, with up to 90% Citation F1 improvement over instruction tuning and 2.5× reduced compute cost compared to judge-based supervision. Ablation studies further reveal the importance of answer-level reward components, while correlation analysis supports the partial validity of semantic signals as reward proxies. These results offer actionable insights into affordable and robust alignment for legal LLMs.

Paper Type: Short

Research Area: Special Theme (conference specific)

Research Area Keywords: legal NLP,Question Answering,LLM Efficiency,NLP in resource-constrained settings,fine-tuning,reinforcement learning,retrieval-augmented generation,domain adaptation,Interdisciplinary Recontextualization of NLP,semantic textual similarity

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: Thai

Submission Number: 5230

Loading