Keywords: Language Models, Transformers, Sequential computation distillation, Token Merging, Surrogate embeddings, Context Reduction, KV Cache Reduction, Inference Efficiency
TL;DR: We introduce a lightweight compression method that dynamically replaces multiple tokens at the input level to generate substitute representations to reduce Transformer sequential computation while preserving performance.
Abstract: Transformer language models process input sequences token by token, resulting in significant computation even when adjacent tokens are semantically redundant or compressible. We introduce a method for distilling sequential computation by replacing spans of input tokens with collapsed representations, computed on the fly by a shared, lightweight merge module. This module generates a single surrogate embedding from static token embeddings that captures the functional role of multiple tokens—without relying on model internals or context—allowing pre-trained models to operate on compressed inputs without architectural changes or re-training. We apply this approach during inference to compress both prompts and intermediate decoding steps, using a rollback mechanism to substitute stored multi-token KV cache entries with their single-step surrogates. Experiments with GPT-2 XL, LLaMA 3.1 8B, LLaMA 3.2 1.5B, and DeepScaleR across language modeling and downstream tasks (question answering, summarization, math reasoning) show up to 40% reduction in effective sequence length, with minimal accuracy degradation. These results highlight that sequential token computation in Transformers can be effectively approximated through condensed surrogate representations that preserves functional input behavior without model updating.
Primary Area: generative models
Submission Number: 23268
Loading