Keywords: Fast sampling
TL;DR: FlashSampling: Fast and Memory-Efficient Grouped Sampling with Gumbel-Max
Abstract: Sampling operations in discrete space are widely used in different fields such as language models, reinforcement learning, VAE, GAN, and neural architecture search. Current sampling methods involve computing the softmax operation across the entire categories, leading to significant computational and memory requirements, particularly when dealing with large sampling categories. This paper presents a novel sampling approach known as FlashSampling, designed to alleviate the computational and communication overhead by circumventing the computation of the softmax operation. Our method maintains mathematical equivalence to conventional sampling strategies while demonstrating significantly enhanced speed and memory efficiency. This is achieved by partitioning the category into distinct groups for independent sampling and then leveraging the Gumble-Max trick to eliminate the need for softmax computation. We substantiate the correctness and efficacy of our method both through mathematical proofs and empirical validation. Extensive experimental outcomes illustrate marked enhancements in speed and memory utilization, with FlashSampling attaining up to 384\% faster sampling times and 1822\% reduced memory consumption.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11425
Loading