FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression

FractalLLM: Lossless Self-Speculative Decoding with Layer Embedded Self-Compression

ACL ARR 2025 May Submission2510 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Autoregressive decoding in large language models (LLMs) necessitates a full forward pass for each generated token, significantly increasing inference latency. To address this limitation, we propose Fractal-LLM, a lossless self-speculative decoding method that embeds a compressed model within selected decoder layers of the original model. Specifically, our approach generates multiple draft tokens in parallel by injecting compressed layers into selected decoder layers. These draft tokens are subsequently verified through a single forward pass of the original model, ensuring the final outputs exactly match those produced by the original model. Experimental results across diverse benchmarks—including GSM8K, XSUM, CNN/DailyMail, and HumanEval—demonstrate that our method achieves substantial inference speed-ups (up to 2.47×) compared to standard autoregressive decoding, without requiring any additional training.

Paper Type: Short

Research Area: Generation

Research Area Keywords: inference methods, text-to-text generation, model architectures, efficient models

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: english

Keywords: Speculative Decoding, Natural Language Generation, Model Compression, Efficient Model

Submission Number: 2510

Loading