NyoomFloat12: Lossless 12-bit Weight Compression for Post-Training Inference

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: eficiency, inference acceleration, lossless compression, LLM inference
Abstract: Scaling reinforcement learning post-training requires generating rollouts efficiently, but GPU memory constrains both the speed and scale of this generation. We present NyoomFloat12 (NF12), a lossless 12-bit fixed-length format for BF16 weights that addresses two GPU memory constraints simultaneously. By analyzing per-bit entropy in trained weight exponents, we find that the upper exponent nibble is nearly constant ($\texttt{0111}$); dropping it yields a fixed-length format with 13-operation SIMT-friendly decode. In the memory-bound regime, fusing decode into GEMV achieves up to 1.28$\times$ speedup, approaching the 1.33$\times$ theoretical maximum. In the memory-constrained regime, the 25\% weight footprint reduction frees capacity for larger batch sizes: on Qwen3-32B with SGLang, buffered NF12 completes 64 rollouts 1.5$\times$ faster by doubling maximum batch size. NF12 is lossless: escape groups are zeroed during bulk decode and overwritten with the stored originals, producing bitwise-exact BF16 output.
Submission Number: 92
Loading