Keywords: reasoning traces, dual objective learning, legal NLP, judicial decision modeling
Abstract: In the legal domain, large language models (LLMs) can improve reliability and interpretability by generating explicit reasoning traces. However, training LLMs to produce high-quality reasoning traces remains challenging. Existing supervised fine-tuning methods struggle when reasoning traces are significantly longer than final answers, as the learning signal for the answer becomes diluted. Meanwhile, reinforcement learning methods such as Group Relative Policy Optimization (GRPO) are not without drawbacks, facing issues including costly reward design, performance plateaus, and reward hacking. To address these challenges, we propose a two-stage training framework. In Stage I, JurisCoT-SFT employs a length-normalized dual objective to balance learning signals between reasoning traces and final answers. In Stage II, Lifecycle-Aware Backtrackable Policy Optimization dynamically activates and deactivates auxiliary reward signals based on their impact on primary performance metrics, enabling efficient reward utilization without manual intervention. Trained on 2.7 million real-world judicial decision triplets and evaluated on a professionally annotated benchmark of 3,462 cases, our method is fine-tuned on Qwen3-8B and achieves state-of-the-art performance on average across evaluation metrics, outperforming both specialized legal LLMs and larger general-purpose models.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: legal NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Chinese
Submission Number: 7452
Loading