Keywords: Causal Inference, Off-policy Evaluation, Experimentation, Nonstationarity, Reinforcement Learning
TL;DR: We present a simple estimator for low bias, low variance estimation of the global average treatment effect from Bernoulli randomized experiments in dynamic, nonstationary environments.
Abstract: Randomized experiments (A/B tests) are widely used to evaluate interventions in dynamic systems such as recommendation platforms, marketplaces, and digital health. In these settings, interventions affect both current and future system states, so estimating the global average treatment effect (GATE) requires accounting for temporal dynamics. Existing estimators—including difference-in-means (DM), off-policy evaluation methods, and difference-in-Q’s (DQ)—perform poorly in nonstationary environments due to high bias and variance. We address this challenge with the truncated policy gradient (TPG) estimator, which replaces instantaneous outcomes with truncated outcome trajectories. Theoretically, it corresponds to a truncated policy gradient that approximates the GATE to first order, yielding provable bias and variance improvements in nonstationary Markovian settings. We validate our theory through a ride-sharing simulation calibrated to New York City taxi data. The results show that a well-calibrated TPG estimator achieves low bias and variance in practical nonstationary settings.
Submission Number: 66
Loading