IntDiff: Mitigating Reward Hacking via Intrinsic rewards for Diffusion Model Fine-Tuning

IntDiff: Mitigating Reward Hacking via Intrinsic rewards for Diffusion Model Fine-Tuning

ICLR 2026 Conference Submission159 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning；AIGC；Diffusion Model

Abstract: Diffusion models have progressed in text-to-image generation, but their ability to optimize predefined objectives remains limited. Although introducing reinforcement learning (RL) fine-tuning can improve performance, it also brings two critical problems: exploration-exploitation imbalance and reward hacking. To address both, we propose a systematic framework, \ourmethod{}, which designs a denoising-based intrinsic reward paradigm to guide exploration. A filtering mechanism is introduced to dynamically monitor changes in text-image alignment, penalize exploratory behaviors that cause degradation, and selectively discard inefficient samples to improve exploration effectiveness and save computation. Furthermore, we propose an adaptive mechanism that enhances exploration in the early stages and shifts to exploitation to stabilize the structure, improve the predefined reward in the late stages, and dynamically filter training steps. A Reflective Diffusion Optimization method is introduced to improve sample efficiency and training effectiveness under sample reduction and step truncation. Overall, this paper aims to significantly improve generation quality and alignment with target objectives, under the premise of reducing computational cost and mitigating reward hacking. Experimental results show that the proposed method improves the alignment and diversity of text and achieves superior performance in various reward metrics.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 159

Loading