When Does Stein Beat Antithetic Sampling? Distribution Complexity in Discrete Gradient Estimation

Published: 03 Mar 2026, Last Modified: 07 Apr 2026ICLR 2026 DeLTa Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: discrete gradient estimation, variance reduction, Stein operators, antithetic sampling, control variates, variational autoencoders, distribution complexity, CAGE algorithm
TL;DR: We prove when Stein operators outperform antithetic sampling for discrete gradients and propose CAGE, a complexity-aware meta-algorithm achieving 82 nats improvement on complex distributions.
Abstract: Training models with discrete latent variables requires gradient estimators that handle non-differentiable sampling. While antithetic methods (DisARM, ARMS) dominate benchmarks on simple datasets, they fail catastrophically on complex distributions---a phenomenon previously unexplained. We prove the **Complexity-Variance Theorem**: antithetic estimator variance scales as $\Omega(\log K)$ with the number of data classes $K$, while Stein-based estimators achieve $O(1)$ variance independent of complexity. This theoretical result predicts a **crossover threshold** $K^* \approx 200$ where Stein methods begin to dominate. Based on this insight, we propose **CAGE** (**C**omplexity-**A**ware **G**radient **E**stimation), a meta-algorithm that automatically selects the optimal estimator: antithetic methods for $K < K^*$ and Stein-Adjoint for $K > K^*$. We validate our theory across **581+ experiments** on five datasets. On simple distributions (MNIST, $K=10$), CAGE matches state-of-the-art ARMS-LOO ($-201.3$ nats). On complex distributions (Omniglot, $K=1623$), CAGE achieves **82.1 nats improvement** over the best antithetic baseline. The predicted threshold aligns precisely with our empirical observations: Stein advantage emerges at 500 classes and grows monotonically. Our work transforms gradient estimator selection from empirical trial-and-error into a principled, complexity-aware decision.
Submission Number: 30
Loading