Differentiable Top-k: From One-Hot to k-Hot

Differentiable Top-k: From One-Hot to k-Hot

ICLR 2026 Conference Submission24859 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: top-k, k-hot, subset, relaxed, differentiable, sampling

TL;DR: We propose a framework for differentiable top-k by generalizing from one-hot to k-hot.

Abstract: The one-hot representation, argmax operator, and its differentiable relaxation, softmax, are ubiquitous in machine learning. These building blocks lie at the heart of everything from the cross-entropy loss and attention mechanism to differentiable sampling. Their $k$-hot counterparts, however, are not as universal. In this paper, we consolidate the literature on differentiable top-$k$, showing how the $k$-capped simplex connects relaxed top-$k$ operators and $\pi$ps sampling to form an intuitive generalization of one-hot sampling. In addition, we propose sigmoid top-$k$, a scalable relaxation of the top-$k$ operator that is fully differentiable and defined for continuous $k$. We validate our approach empirically and demonstrate its computational efficiency.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 24859

Loading