Keywords: computational efficiency, accelerating large model inference, Speculative Decoding
Abstract: Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks.
The tree-search-based reasoning methods address this by encouraging the exploration of intermediate steps, surpassing the capabilities of chain-of-thought prompting.
However, significant inference latency is introduced due to the systematic exploration and evaluation of multiple thought paths.
This paper introduces SEED, a novel and efficient inference framework to improve both runtime speed and GPU memory management concurrently.
Based on a scheduled speculative execution, SEED efficiently handles multiple iterations for thought generation and state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching.
Extensive experimental evaluations on three reasoning datasets demonstrate the superior speedup performance of SEED.
Submission Number: 37
Loading