Keywords: Large Language Models, Supervised Fine-Tuning, Output Diversity
Abstract: Large Language Models (LLMs) typically rely on Supervised Fine-Tuning (SFT) with Cross-Entropy (CE) loss to specialize in downstream tasks. However, CE forces the distribution toward one-hot targets and ignores alternative continuations, thereby limiting output diversity—a key drawback for generative applications that rely on sampling-based exploration.
In this paper, we propose ``Training with Sparsemax+, Testing with Softmax (TS$^2$)''. Intuitively, sparsemax and its tailored loss mask the gradients of probabilities outside the support set, leaving excessive probability mass on irrelevant tail classes when evaluating with softmax. To address this issue, we propose an improved variant, Sparsemax+, for training, which augments the sparsemax loss with a suppression term that penalizes the out-of-support probabilities. At testing, we decode with softmax, yielding calibrated, non-degenerate probabilities where plausible near-ties survive.
We fine-tuned Llama-3.1-8B and Qwen-2.5-7B with TS$^2$, achieving consistent improvements in accuracy and output diversity across chat, code, and open-domain benchmarks. Together, these results demonstrate that TS$^2$ provides a practical, drop-in solution for fine-tuning LLMs that are both more accurate and more creative.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18411
Loading