Keywords: Differentiable Simulation, Aerial Robots, Reinforcement Learning
Abstract: Quadrotor control policies can be trained with high performance using the exact gradients of the differentiable rewards to optimize policy parameters via backpropagation-through-time (BPTT). However, designing a fully differentiable reward architecture is often challenging in real-world high-level tasks rather than control in simulation. Partially differentiable rewards will result in biased gradient propagation that severely degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines learned 0-step returns and analytical cumulative rewards, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to improve training efficiency. We evaluate ABPT on four representative quadrotor flight tasks in both real world and simulation. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing representative learning algorithms, particularly in tasks involving partially differentiable rewards.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 11311
Loading