PinTok: Tokenizers Deserve Dedicated Pinned CPU-Compute and Memory

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Faster Tokenizer, Machine Learning System, CPUs, Network Offload
TL;DR: We present a fast, efficient, and fungible architecture for tokenizers by removing redundant system overheads, yielding lower end-to-end latency for all models and applications requiring tokenizers.
Abstract: Tokenization is the first point of contact between large language models (LLMs) and text data, yet it has not been viewed by many as a component of LLMs worth accelerating. During inference, tokenizers typically rely on simple dictionary lookups and are executed on CPUs as standard processes. This approach, however, introduces significant overhead from scheduling delays, core selection, data copying, and other system-level costs. These inefficiencies become problematic in latency-sensitive applications such as embedding, small language models, and agentic AI. In this paper, we present the Pinned Tokenizer (PinTok), a novel tokenizer architecture that reduces redundant hardware, operating system, and networking overhead through three key innovations: core and memory pinning, scheduling and context switch avoidance, and duplicate network packet copy and processing avoidance. Our implementation of PinTok can serve as a drop-in replacement for existing tokenizer deployments, delivering latency reductions of up to 95\% (average), 97\% (P50), 94\% (P90), and 87\% (P99) along with throughput improvements of up to 2,084\%.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 14257
Loading