Keywords: top-k, k-hot, subset, relaxed, differentiable, sampling
TL;DR: We propose a framework for differentiable top-k by generalizing from one-hot to k-hot.
Abstract: The one-hot representation, argmax operator, and its differentiable relaxation, softmax, are ubiquitous in machine learning. These building blocks lie at the heart of everything from the cross-entropy loss and attention mechanism to differentiable sampling. Their $k$-hot counterparts, however, are not as universal. In this paper, we consolidate the literature on differentiable top-$k$, showing how the $k$-capped simplex connects relaxed top-$k$ operators and $\pi$ps sampling to form an intuitive generalization of one-hot sampling. In addition, we propose sigmoid top-$k$, a scalable relaxation of the top-$k$ operator that is fully differentiable and defined for continuous $k$. We validate our approach empirically and demonstrate its computational efficiency.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 24859
Loading