Back Propagation through Auctions: First-Order Policy Gradient for Auto-Bidding

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Auto-Bidding, Auctions, Online Advertising
TL;DR: For auto-bidding, first-order policy gradients can be obtained almost for free from historical logs—without requiring differentiable simulators or learned environment models.
Abstract: In online advertising, auto-bidding agents compete in high-frequency auctions by setting a bidding parameter for each time interval, that scales estimated impression values into actual bids. While prior work has framed this sequential decision problem as a reinforcement learning (RL) task, we identify that standard RL methods overlook key structural properties of the auto-bidding environment: agents receive fine-grained, impression-level feedback, and the objective is nearly differentiable due to the high density of impressions within each interval. We leverage this structure to propose First-Order policy gradient for auto-Bidding (FOB), a method that directly computes policy gradients by smoothing historical auction data and backpropagating through the sequential auctions. FOB leverages Myerson's lemma, a cornerstone of auction theory, to explicitly derive gradients. We validate FOB on AuctionNet, a public auto-bidding environment, where it consistently outperforms standard RL baselines and domain-specific auto-bidding methods, achieving superior performance with greater stability and faster convergence.
Primary Area: reinforcement learning
Submission Number: 7476
Loading