FlipTS : Failure-to-Success Mathematical Self-Refinement via Trajectory-Aware Sampling in End-to-End Online RL

ACL ARR 2026 January Submission3277 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model (LLM), Mathematical Reasoning, Self-Refinement, End-to-End Reinforcement Learning (RL), Trajectory-Aware Sampling
Abstract: Self-Refinement has emerged as a promising paradigm for improving Large Language Model (LLM) performance by iteratively improving its response at inference time. Existing training methods typically treat refinement as a decoupled task and rely on pre-generated responses, thereby detaching training from the model's actual reasoning trajectories and hindering performance. In this paper, we propose FlipTS (Failure-to-Success Learning for self-refinement with Potential-guided Trajectory-Aware Sampling), an end-to-end reinforcement learning (RL) framework that optimizes the entire self-refinement loop online. We identify a fundamental bottleneck in this end-to-end setting: the sparsity of informative refinement behaviors Failure-to-Success (F2S). FlipTS addresses this by utilizing a non-stationary Bayesian model to estimate data potential and applying trajectory-aware sampling to enrich the training process with valuable refinement signals. Experiments on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across challenging mathematical benchmarks demonstrate that FlipTS consistently outperforms both offline and online RL baselines. Notably, it exhibits robust generalization to scientific reasoning and safety tasks without domain-specific tuning. Our code will be open-sourced.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling, NLP Applications
Languages Studied: English
Submission Number: 3277
Loading