ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards

Fanxing Li; Fangyu Sun; Tianbao Zhang; Shuyu Wu; Linzuo Zhang; Yechen Zhang; Renbiao Jin; Yu Hu; Wenxian Yu; Danping Zou

ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards

Fanxing Li, Fangyu Sun, Tianbao Zhang, Shuyu Wu, Linzuo Zhang, Yechen Zhang, Renbiao Jin, Yu Hu, Wenxian Yu, Danping Zou

18 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Differentiable Simulation, Aerial Robots, Reinforcement Learning

Abstract: Quadrotor control policies can be trained with high performance using the exact gradients of the differentiable rewards to optimize policy parameters via backpropagation-through-time (BPTT). However, designing a fully differentiable reward architecture is often challenging in real-world high-level tasks rather than control in simulation. Partially differentiable rewards will result in biased gradient propagation that severely degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines learned 0-step returns and analytical cumulative rewards, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to improve training efficiency. We evaluate ABPT on four representative quadrotor flight tasks in both real world and simulation. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing representative learning algorithms, particularly in tasks involving partially differentiable rewards.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 11311

Loading