Keywords: Reinforcement Learning, Auto-Bidding, Auctions, Online Advertising
TL;DR: For auto-bidding, first-order policy gradients can be obtained almost for free from historical logs—without requiring differentiable simulators or learned environment models.
Abstract: In online advertising, auto-bidding agents compete in high-frequency auctions by setting a bidding parameter for each time interval, that scales estimated impression values into actual bids.
While prior work has framed this sequential decision problem as a reinforcement learning (RL) task, we identify that standard RL methods overlook key structural properties of the auto-bidding environment: agents receive fine-grained, impression-level feedback, and the objective is nearly differentiable due to the high density of impressions within each interval.
We leverage this structure to propose First-Order policy gradient for auto-Bidding (FOB), a method that directly computes policy gradients by smoothing historical auction data and backpropagating through the sequential auctions.
FOB leverages Myerson's lemma, a cornerstone of auction theory, to explicitly derive gradients.
We validate FOB on AuctionNet, a public auto-bidding environment, where it consistently outperforms standard RL baselines and domain-specific auto-bidding methods, achieving superior performance with greater stability and faster convergence.
Primary Area: reinforcement learning
Submission Number: 7476
Loading