Transparent and Robust RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

ICLR 2026 Conference Submission12754 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Retrieval-Augmented Generation, Reinforcement learning, Adaptive rewards
Abstract: Retrieval-Augmented Generation (RAG) delivers substantial value in knowledge-intensive applications. Many recent works use reinforcement learning (RL) to effectively elicit strong reasoning in *RAG generators*. However, two key challenges remain unresolved: **(1) Transparency**: most prior methods do not explicitly indicate which references are actually used during the reasoning that leads to the final answer, limiting interpretability and visibility; **(2) Stability**: the KL divergence estimator used in existing RL-based approaches may cause gradient spikes, which can lead to unstable training. To address these challenges, we propose **A**daptive-**R**ewarded **E**vidence **N**avigation **A**gent (**ARENA**), a transparent and robust RAG generator framework trained via RL with designed rewards. Based on our proposed structured protocol, KL divergence stabilization, and adaptive reward calculation modules, **ARENA** enables RAG generator to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, abundant experiments with various baselines demonstrate that our model achieves highly transparent outputs with 10–30% accuracy improvements across three multi-hop QA datasets, which is comparable with advanced closed-source LLMs (e.g., OpenAI-o1, DeepSeek-R1). Further analyses show ARENA has strong generalization to unseen datasets and tasks. Our models and codes will be publicly released.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12754
Loading