# Thinker: Learning to Think Fast and Slow

<p align="center">
  <img src="figure/logo.png" alt="Thinker-Task Logo" width="600"/>
</p>

This repository contains the code and resources for the paper **"Thinker: Learning to Think Fast and Slow"**.

Our work introduces the **Thinker task**, a novel four-stage Reinforcement Learning (RL) approach for question-answering (QA) designed to enhance the reasoning capabilities of Large Language Models (LLMs) by explicitly training distinct cognitive abilities: intuition (Fast Thinking), evaluation (Verification), refinement (Slow Thinking), and integration (Summarization).

<p align="center">
  <img src="figure/teaser.png" alt="Thinker-Task Training Performance Teaser" width="800"/>
  <br/>
  <em>Figure: Evluation Accuracy for Qwen2.5-1.5B (Left) and DeepSeek-R1-Distill-Qwen-1.5B (Right) models.</em>
</p>


## Evaluation Results

Performance comparison across various mathematical reasoning benchmarks. All scores are Pass@1 accuracy (%) averaged over 16 samples. Top score in each benchmark column (within each model group) is bolded.

| **Method** | **MATH 500** | **AIME 2024** | **AIME 2025** | **GPQA Diamond** | **Olympiad bench** | **AMC 23** | **Minerva Math** | **College Math** | **Avg.** |
| :------------------------------------- | :----------: | :-----------: | :-----------: | :--------------: | :----------------: | :--------: | :--------------: | :--------------: | :------: |
| **_Qwen2.5-1.5B (Q1.5B)_** |              |               |               |                  |                    |            |                  |                  |          |
| Pretrained                             |     9.05     |     0.00      |     0.00      |       4.55       |        3.09        |    4.06    |       2.30       |       7.40       |   3.81   |
| Baseline                               |    57.98     |     3.33      |   **3.33** |      21.46       |       24.54        |   34.38    |      17.78       |      36.21       |  24.88   |
| Thinker                                |  **64.25** |   **6.25** |     2.50      |      23.74       |     **28.11** | **40.62** |      19.03       |     **38.33** | **27.85**|
| Thinker-Fast                           |    61.60     |   **6.25** |     2.50      |    **26.39** |       24.78        |   35.94    |      18.66       |      37.85       |  26.75   |
| ORZ                                    |    58.00     |     3.50      |     1.00      |      16.80       |         -          |     -      |        -         |        -         |    -     |
| SimpleRL                               |    59.00     |     4.20      |       -       |        -         |       21.00        |   35.00    |    **20.20** |        -         |    -     |
| **_DeepSeek-R1-Distill-Qwen-1.5B (R1.5B)_** |              |               |               |                  |                    |            |                  |                  |          |
| Pretrained                             |    76.21     |    17.50      |    17.92      |      13.76       |       37.46        |   55.94    |      24.82       |      38.85       |  35.31   |
| Baseline                               |    86.24     |    35.42      |    23.75      |      25.69       |       49.22        |   72.81    |      32.08       |      42.02       |  45.90   |
| Thinker                                |  **87.02** |   **35.62** |   **27.71** |    **36.08** |     **54.21** | **81.72** |    **33.23** |     **42.77** | **49.80**|
| Thinker-Fast                           |    77.21     |    11.46      |    11.46      |      30.08       |       40.39        |   59.22    |      29.23       |      41.37       |  37.55   |

---



## Installation

This project requires Python >=3.10.

### 1\. System Prerequisites

Ensure you have essential system libraries. For Debian-based systems (like Ubuntu), you can install them using:

```
sudo apt-get update
sudo apt-get install -y ffmpeg libsm6 libxext6
```


### 2\. Project Setup

It's recommended to use a virtual environment (e.g., Python's `venv` or Conda).

Once you have cloned the repository and navigated into the main project directory (where `pyproject.toml` is located), activate your chosen virtual environment. Then, install the project and its dependencies:

```
pip install -e .
```

This command installs the `thinker_task` project in editable mode and pulls all required Python packages with their specific versions as defined in `pyproject.toml`.

### Required Package Versions

-   **Python:**  `>=3.10`
-   **Python Packages:** All specific versions for packages like `torch`, `deepspeed`, etc., are listed in the `pyproject.toml` file.

### Download Base Model

It is recommended to download the base model R1.5B ([DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)) and Q1.5B ([Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B)) under the directory `large_data/base`, using the following command:

```bash
python -c "from huggingface_hub import snapshot_download; print(snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', local_dir='large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'))"
python -c "from huggingface_hub import snapshot_download; print(snapshot_download('Qwen/Qwen2.5-Math-1.5B', local_dir='large_data/base/Qwen/Qwen2.5-Math-1.5B'))"
python script/add_token.py large_data/base/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
```

The last line adds two special tokens, `<|im_start|>` and `<|im_end|>`, to R1.5B that were present in Q1.5B and mark the start and end of a prompt.

### Start Thinker Agent Training

Single node, R1.5B Thinker agent (replace `r1_5b` with `q1_5b` for Q1.5B model):
```bash
python -m playground.thinker_r1_5b
```
Multi-node Training:

First on master node, run:
```bash
ray start --head
```

then on other nodes, run:
```bash
ray start --address='<master-node-ip>:<master-node-port>'
```

then on master node, run (adjust `NUM_NODE` as needed; both 2 and 4 should work fine):
```bash
NUM_NODE=4 python -m playground.thinker_r1_5b
```

## Data

The training data are sourced from [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data).

## Acknowledgements

- Our training framework is built on [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [vllm](https://github.com/vllm-project/vllm), [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) and [ray](https://github.com/ray-project/ray).
- Our model is based on [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) and [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).

