Towards Better Legal Reasoning LLMs: Signal Balancing and Reward Scheduling

ACL ARR 2026 January Submission7452 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reasoning traces, dual objective learning, legal NLP, judicial decision modeling
Abstract: In the legal domain, large language models (LLMs) can improve reliability and interpretability by generating explicit reasoning traces. However, training LLMs to produce high-quality reasoning traces remains challenging. Existing supervised fine-tuning methods struggle when reasoning traces are significantly longer than final answers, as the learning signal for the answer becomes diluted. Meanwhile, reinforcement learning methods such as Group Relative Policy Optimization (GRPO) are not without drawbacks, facing issues including costly reward design, performance plateaus, and reward hacking. To address these challenges, we propose a two-stage training framework. In Stage I, JurisCoT-SFT employs a length-normalized dual objective to balance learning signals between reasoning traces and final answers. In Stage II, Lifecycle-Aware Backtrackable Policy Optimization dynamically activates and deactivates auxiliary reward signals based on their impact on primary performance metrics, enabling efficient reward utilization without manual intervention. Trained on 2.7 million real-world judicial decision triplets and evaluated on a professionally annotated benchmark of 3,462 cases, our method is fine-tuned on Qwen3-8B and achieves state-of-the-art performance on average across evaluation metrics, outperforming both specialized legal LLMs and larger general-purpose models.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: legal NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Chinese
Submission Number: 7452
Loading