Learning When to Be Sparse: Adaptive Activations via Two-Parameter Entropy

Published: 02 Mar 2026, Last Modified: 25 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sharma-Mittal entropy, activation function, softmax, entmax, sparsemax
TL;DR: We introduce SharMiX, a learnable two-parameter activation function based on Sharma–Mittal entropy that generalizes softmax and sparse alternatives, automatically adapting its sparsity
Abstract: The softmax operator, while foundational to modern machine learning, arises from Shannon entropy regularization, an assumption rooted in classical statistical mechanics that breaks down for systems with long-range correlations, power-law tails, or fractal structure. Such non-extensive regimes are common in practice: real-world datasets often exhibit Zipfian class frequencies, under which classical entropy misallocates probability mass. Sparse alternatives such as -entmax address this issue via Tsallis entropy, but they rigidly tie sparsity to a single parameter. We introduce SharMiX, a two-parameter activation based on Sharma-Mittal entropy that unifies the Shannon, Rényi, and Tsallis families. We derive closed-form, Lipschitz-continuous Jacobians for the activation outputs with respect to both the input logits and the entropy parameters (q, r), enabling end-to-end learning via implicit differentiation. This allows SharMiX to dynamically adapt to the statistical properties of the data, becoming sparse for heavy-tailed, non-extensive distributions and dense for balanced, extensive ones. Experiments on text classification, CIFAR-100, and ImageNet-1k demonstrate that SharMiX automatically navigates the accuracy-sparsity trade-off, successfully adapting to the underlying class-frequency distribution.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 119
Loading