SPEAK: Spiking Neurons as an Entropy-Aware Tokenizer for Large Language Models

ACL ARR 2026 January Submission396 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: tokenization, gradient-based tokenizer, spiking neuron, generation entropy
Abstract: Tokenizers play a critical role in large language model studies. Despite recent advances, existing tokenizers fail to explicitly leverage historical tokenization results when making subsequent token decisions, nor do they selectively utilize such history based on contextual relevance. We propose SPEAK, a tokenizer that integrates spiking neurons to explicitly leverage historical tokenization results. Furthermore, we introduce an entropy-aware reset mechanism that selectively leverages history based on contextual relevance, which is determined by token-level entropy. High-entropy tokens are treated as contextual boundaries, whereas low-entropy tokens between consecutive such boundaries exhibit strong contextual relevance. Accordingly, we induce hard reset at high-entropy tokens to discard irrelevant historical tokenization results, and soft reset at low-entropy tokens to preserve and leverage relevant history. Experiments on 2 language models and 5 datasets spanning 16 languages demonstrate superior cross-lingual adaptability, with competitive performance and efficiency. Our code is publicly available at https://anonymous.4open.science/r/ACL-ARR-2026.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: subword representations
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: en, fi, he, ar, bg, de, el, es, fr, hi, ru, sw, tr, ur, vi, pt
Submission Number: 396
Loading