FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

ICLR 2026 Conference Submission22190 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Classification, Efficient Inference Methods, Embedding Approaches, Hardware and Systems, Natural Language Processing, Representation Learning
TL;DR: We introduce FlashHead, an inference-efficient replacement for the dense classification head that is both training-free and hardware-friendly.
Abstract: Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, where the output head is viewing the output head as a retrieval problem rather than a dense computation. FlashHead introduces four key innovations: (1) equal-sized clustering of embeddings, (2) multi-probe retrieval to model heads, (3) a novel inference-time sampling mechanism, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to 1.75x while maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22190
Loading