Keywords: Self-Correction, Process Supervision, AI Safety, Trustworthy AI, Uncertainty Quantification, Hallucinations, Generative Models, Large Language Models, Interpretability
TL;DR: We introduce Reflexion, a framework that trains language models to find and fix their own mistakes by learning a 'generate-critique-refine' process, significantly improving reliability and making smaller models perform like much larger ones.
Abstract: Large Language Models (LLMs) have achieved widespread adoption, yet their reliability is fundamentally undermined by their tendency to generate plausible but incorrect content, a phenomenon known as hallucination. This unreliability is a critical barrier to their safe deployment in high-stakes domains, and current mitigation strategies, such as external tool use or Reinforcement Learning from Human Feedback (RLHF), are largely reactive, treating the model as a black box and failing to correct the flawed reasoning processes that lead to these errors. This paper investigates a new paradigm: can we endow LLMs with an internalized skill of self-correction? To address this, we introduce Reflexion, a framework that trains a single, unified model to explicitly follow a generate
→
critique
→
refine reasoning trace. To enable this process-based supervision, we developed ReTrace, a novel dataset of 200,000 structured self-correction examples bootstrapped from a teacher model. Furthermore, we propose an efficient inference mechanism, Uncertainty-Triggered Deliberation (UTD), which dynamically engages this deliberative process only when the model is uncertain. Our experiments show that a Reflexion-trained 8B model significantly outperforms its baseline counterparts, achieving a 15.2% absolute improvement on the TruthfulQA benchmark and a 9.8% improvement on GSM8K. Notably, our 8B model surpasses the performance of a standard 70B model on factuality, demonstrating the immense parameter efficiency of our approach. Our findings establish that supervising the reasoning \textit{process}, not just the final outcome, is a more direct and effective path towards building reliable AI. Reflexion represents a critical step away from reactive, black-box fixes and towards creating more transparent, trustworthy, and self-correcting AI systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24089
Loading