TL;DR: We propose to generate loss function from f-divergences and do experiments on language modeling tasks.
Abstract: The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling.
It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator.
In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures.
We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones.
By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence.
On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.
Lay Summary: We propose to build new cost objectives for deep learning by modifying their theoretical blueprint. We test these new losses on some real problems and observe that our approach can lead to some improvements on some language model tasks.
Primary Area: Optimization
Keywords: loss functions, f-divergences, entropies, Fenchel conjugates
Submission Number: 10795
Loading