Gearbx: Entropy-Routed Dynamic Quantization for LLM Inference

Published: 28 May 2026, Last Modified: 28 May 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0
Abstract: Not all tokens are equally hard to generate. Common tokens such as ``the'' or ``of'' are produced from sharply peaked output distributions, while mathematical reasoning, rare vocabulary, and creative transitions produce broader distributions. Static quantization ignores this variance and applies one bit-width to every token. \gearbx{} treats quantization precision as a per-token decision. At each decoding step it reads the Shannon entropy of the output-logit distribution and routes the next forward pass through one of three precision tiers: 4-bit packed weights for confident tokens, 8-bit weights for moderate-uncertainty tokens, and fp16 weights for difficult tokens. When shifting down, the system physically replaces \code{nn.Linear} modules with packed \code{QuantizedLinear} buffers and offloads the original fp16 weights to CPU, producing real device-memory reduction. Because autoregressive decoding is memory-bandwidth-bound, fewer bytes per weight can translate directly to faster tokens when fused packed-weight kernels are used. Three design choices are new in combination: output-logit entropy as the sole routing signal, physical module replacement with memory offloading at inference time, and gear-oscillation suppression through rolling-window averaging, hysteresis, and minimum gear duration. The current implementation targets Apple Silicon via MPS and MLX, plus CPU execution on ARM NEON, focusing on consumer hardware where memory capacity and bandwidth are binding constraints.
Loading