Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

ACL ARR 2026 January Submission10382 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Collaborative Decoding
Abstract: Collaborative decoding between large and small language models (LLMs/SLMs) is a key strategy to overcome LLM limitations in training and inference efficiency. While methods like speculative decoding and proxy tuning exist, a unifying understanding of these approaches is needed. Inspired by dual-process theory, we introduce FS-GEN, a framework defining LLMs as ``System 2'' (deliberate) and SLMs as ``System 1'' (intuitive). FS-GEN provides a unified lens to analyze collaborative decoding, revealing that minimal System 2 intervention ($<20\%$ of tokens in the generated completions) is often sufficient. We uncover a parameter-ratio scaling law governing this interaction and demonstrate that the effectiveness of collaboration hinges on the uncertainty of System 1's next-token predictions. This uncertainty-centric view offers novel insights into optimizing collaborative decoding and developing more efficient and reliable language generation systems.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: inference methods;interactive and collaborative generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 10382
Loading