InferSpec: Adaptive Inference-Time Compute with Ensemble Verifier-Guided Speculative Decoding for Efficient Reasoning
Keywords: Inference-time Compute, Speculative Decoding, Ensemble Verifiers, Multi-Step Reasoning
TL;DR: InferSpec improves speculative decoding by adaptively verifying reasoning steps using internal measures of confidence and grounding, achieving higher accuracy on reasoning tasks while maintaining efficiency without external verifiers.
Abstract: Large language models (LLMs) are effective at multistep reasoning, but suffer from high inference costs, making efficient deployment challenging. Although speculative decoding (SD) offers latency reductions by letting a lightweight draft propose tokens that a stronger target verifies, yet its token-centric nature admits subtle flaws in intermediate steps to propagate, ultimately producing incorrect final output.
The existing literature, such as reward-guided SD, rely on external pre-trained reward models, which increase latency and limit generalizability. To overcome this limitation, we propose InferSpec, a mathematically grounded, verification-aware framework for adaptive inference-time compute allocation.
At each step, InferSpec samples multiple draft candidates and applies a self-consistency selector to choose a representative one. It then evaluates the selected step using two model-internal criteria: (i) Attention-Based Grounding Verification (ABGV), which computes grounding scores from attention rollout matrices to ensure attribution to inputs or prior steps, and (ii) Log-Probability-Based Verification (LPBV), which bounds token-level confidence.
These signals form a weighted ensemble score with formal guarantees that only grounded, high-confidence steps are accepted; uncertain steps escalate to the target model, allocating compute selectively.
Experiments on MATH500, GSM8K, Gaokao-2023-En, and OlympiadBench show that InferSpec improves accuracy by up to 3.6% while reducing latency by 11%, consistently outperforming both standard SD and reward-guided SD.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 16181
Loading