LLM2Token: Distilling Large Language Models into Task-Specific Tokenizer

ICLR 2026 Conference Submission18840 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tokenizer, Distrill, Large language models (LLMs)
TL;DR: We present LLM to Tokenizer, a new distrilling method that preserves the prior knowledge from LLMs in the form of tokenizer
Abstract: We present LLM to Tokenizer, a new distrilling method that preserves the prior knowledge from LLMs in the form of tokenizer, diverging from neural network based methods. This simple, intuitive method allows strong performance even only using one-hot encoding and a simgle-layer logistic regression. Based on the generated tokenizer, without any pre-training, surpressing GPT-2 with just 0.01\% parameters. On event recognition tasks, the L2T method achieves an F1 score of 0.408 while using only 0.1\% of the parameters compared to previous models. We also observed that stronger foundation models lead to improved tokenizer performance. And long tokenizers can harm the performance since the capacity of single-layer logistic regression is limited. This demonstrates a zero-shot capability of LLMs--through training on internet-scale corpora, they can recognize words that are important for specific tasks. We released all models, codes and the dataset to promote the furture exploration.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18840
Loading