Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

18 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Zero-Label Learning RL, Self-Confidence, Reinforcement Learning
TL;DR: RLSC is a sample-efficient post-training method that boosts LLM performance using only the model’s own confidence, without any labels.
Abstract: Large Language Models (LLMs) have demonstrated strong performance on reasoning tasks, but post-training optimization remains essential for aligning their behavior with specific task objectives. Existing reinforcement learning (RL) approaches often rely on costly human annotations or external reward models, limiting their scalability in real-world applications. To address this, we propose Reinforcement Learning via Self-Confidence (RLSC)—a method that uses the model’s own confidence in its outputs as the reward signal, without requiring human labels, preference models, or manually crafted reward functions. RLSC is also highly sample-efficient: it only needs 1 to 8 samples per problem, and typically converges within 15 to 30 training steps. Under the Pass@1 evaluation metric, Qwen-Math-7B achieves significant performance improvements across several mathematical benchmarks: AIME2024 +6.7\%, AMC23 +33.1\%, Math500 +32.3\%, Minerva +29.8\%.On average, RLSC delivers a 23.68\% improvement across these benchmarks. Notably, the effectiveness of RLSC is not limited to the Qwen series; it also leads to substantial performance gains on other mainstream models, including Olmo-7B, DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Llama-8B, Gemma-4B, and LLaMA-8B, etc. In summary, RLSC offers a simple, efficient, and scalable post-training method for pretrained language models, enabling significant performance gains with few training steps.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 13102
Loading