Dynamic Trust Region Adaptation for \\ Human-in-the-Loop Reinforcement Learning \\ in Code Refinement

Dynamic Trust Region Adaptation for \\ Human-in-the-Loop Reinforcement Learning \\ in Code Refinement

ICLR 2026 Conference Submission25440 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Refinement

Abstract: We propose a dynamic trust region adaptation framework for Human-in-the-Loop Reinforcement Learning (HITL-RL) in code refinement to address the challenge of incorporating unskilled human feedback into policy updates. Conventional methods handle all feedback in the same way, and this may result in poor convergence because not all feedback is of the same quality. The proposed system presents a Bayesian-driven Feedback Confidence Estimator, which geometrizes the faith in the human as the sole reliable agent as a dynamically updated score of confidence, and an Adaptive Trust Region Controller to modulate the policy updates based on a dynamically changing confidence score. High confidence feedback works to enlarge the trust region so they will explore, while low confidence feedback works to shrink the trust region so they will avoid overfitting. Furthermore, the framework has a confidence weighting reward shaping mechanism and a gated policy network to selectively favor reliable feedback during the training process. Implemented with transformative architectures including Codex-style policy network and DeBertA-v3 feedback encoder, closed-looped adaptation to feedback uncertainty.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 25440

Loading