<div align="center">

# SpecExit: Accelerating Large Reasoning Model via Speculative Exit

</div>

**SpecExit** is a novel framework that accelerates large reasoning models (LRMs) by integrating an efficient early-exit mechanism with speculative decoding. It addresses the "overthinking" problem in LRMs, where models produce unnecessarily long outputs, leading to high inference latency.

Our method trains a lightweight draft model to predict not only future tokens but also an early-exit signal directly from its hidden states. This unique design eliminates the probing overhead found in other early-exit methods, achieving significant speedups without compromising task accuracy.

This repository contains the official implementation and resources for the paper: **SpecExit: Accelerating Large Reasoning Model via Speculative Exit**.

---

## 🚀 Abstract

Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment.  
To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems.  
Inspired by the use of hidden states in speculative decoding, we propose **SpecExit**, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead.  
Our method offers significant improvements,  achieving up to 66% generation length reduction and 2.5× end-to-end speedup compared with the speculative decoding baseline, without compromising accuracy.  
Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning.


## ✨ Key Highlights

- **Signals Extracted for Early Exit**: We derive early-exit signals from hidden features and integrate them into speculative decoding, enabling reliable early exit for efficient reasoning. 
- **General and Practical Framework**: We implement SpecExit, a reasoning-aware early-exit framework, in both PyTorch and vLLM, making it easy to deploy across diverse inference environments. 
- **Substantial End-to-End Performance Gains**: SpecExit reduces reasoning length by 66\% and achieves up to 2.5$\times$ lower latency than speculative decoding while maintaining accuracy.

<p align="center">
  <img src="figs/method_overall.png" width="80%" />
  <br>
  <b>Figure 1:</b> An overview of the SpecExit framework.
</p>


## 🛠️ Installation

1.  Clone the repository:
    ```bash
    wget -O SpecExit.zip https://anonymous.4open.science/api/repo/SpecExit-B802/zip
    unzip SpecExit.zip -d SpecExit
    cd SpecExit 
    ```

2.  Install the required dependencies:
    ```bash
    pip install -r requirements.txt
    ```

## ⚡ Quick Start

### 1. Training

The training process involves fine-tuning a draft model with an auxiliary prediction head for the early-exit signal. Use the provided DeepSpeed script to start training.

-   **Configure**: Set your model paths and training data paths in `configs/ds_config.json` and `scripts/train_eagle3_qwen3_4B_cpr.sh`.
-   **Run Training**:
    ```bash
    bash scripts/train_eagle3_qwen3_4B_cpr.sh
    ```

This script will handle the distributed training and save the checkpoints to the specified directory.

### 2. Inference and Evaluation

To run inference with the trained `SpecExit` model and evaluate its performance on benchmarks like GSM8K, GPQA, etc., use the `gen_ea_answer.py` script.

-   **Configure**: Modify `scripts/gen_ea_answer.sh` to set the paths for your base model (`--base-model-path`) and the trained `SpecExit` draft model (`--ea-model-path`). You can also adjust parameters like `--exit-threshold`.
-   **Run Inference**:
    ```bash
    bash scripts/gen_ea_answer.sh
    ```

The script will generate answers for the specified benchmark and save the results in the `outputs/` directory.

## 📈 Results

Our experiments show that SpecExit significantly outperforms the speculative decoding baseline across multiple reasoning benchmarks. It provides a much better trade-off between accuracy and latency.


<p align="center">
  <img src="figs/main_result.png" width="80%" />
  <br>
  <b>Figure 2:</b> Experiment main results.
</p>

## 📄 License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
