Encoding the LLM Vocabulary Bottleneck

Harsh Agarwal; Yujing Huang; Shamika Ariyawansa; Oriol Pujol; Moacir A Ponti; Muhammad Awais; Berrin A. Yanikoglu; Tolga Birdal; Cemre Zor

Encoding the LLM Vocabulary Bottleneck

Harsh Agarwal, Yujing Huang, Shamika Ariyawansa, Oriol Pujol, Moacir A Ponti, Muhammad Awais, Berrin A. Yanikoglu, Tolga Birdal, Cemre Zor

17 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Softmax, Error Correcting Output Codes (ECOC), Pretraining, Fine-tuning, compression, scalability

TL;DR: We replace the costly Softmax layer in LLMs with an ECOC-based framework, showing that efficient code designs can cut parameters by 50–67% while retaining most accuracy.

Abstract: Large Language Models (LLMs) have simplified natural language processing tasks by leveraging their ability to learn from massive volumes of data and generalise across a wide range of applications. However, as LLMs continue to scale in size and complexity, optimizing their computational efficiency has become a critical challenge. One of the major contributors to this complexity is the decision layer (referred to as the \textit{Softmax layer}), consisting of a fully connected network matched with a Softmax activation function, which scales linearly with vocabulary size resulting in high computational costs. In this work, we first propose a framework based on Error Correcting Output Coding (ECOC), which enables the encoding of several decision boundary formulation techniques, including Softmax, to be plugged in as the decision layer of an LLM. Using this framework, we analyse the minimal design strategy for defining the simplest decision boundary with optimal computation efficiency, and propose and explore extensions to this strategy to study the accuracy-complexity trade-off compared to the Softmax-based strategy using both fine-tuning and pretraining settings. We show it is possible to maintain 90\% of the Softmax accuracy when pretraining is an option, and retain 83.38\% of the F1 score during fine-tuning, while using only 50\% of the decision layer parameter size in both cases. Further gains arise by extending codeword length with random bits, and by increasing the intermediate hidden dimensions in MTL-ECOC. Overall, this work establishes the viability of substantially reducing the computational and architectural complexity of the output layer, while formalizing the integration of the ECOC framework within LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9552

Loading