# Self-Verifying Reflection Helps Transformers with CoT Reasoning

This is the source code for our paper accepted by NeurIPS 2025: **Self-Verifying Reflection Helps Transformers with CoT Reasoning**.

- **Main Contributor:** Zhongwei Yu (zyu950@connect.hkust-gz.edu.cn).
- **Other Contributors:** Wannian Xia (xiawannian2020@ia.ac.cn), Xue Yan (yanxue2021@ia.ac.cn).

Paper Links:
- [OpenReview](https://openreview.net/forum?id=4MvqmXnCEr) (not camera-ready)
- ArXiv (Comming Soon)
- NeurIPS 2025 Proceedings (Comming Soon)

## Requirements

This project requires Python >= 3.12. All dependencies are listed in `requirements.txt`. To quickly install all required packages, run:

```
pip install -r requirements.txt
```

**Note**:

- Ensure you have installed `litgpt` with the exact version (0.4.12) specified in the requirements.
- It is recommended to manually install PyTorch beforehand; otherwise, `pip` may automatically install a CPU-only version.

## Experiment Commands

`run.py` serves as the main entry point for various experiments. You can use any supported command `CMD` with an argument list `ARGS` as follows:

```
python run.py CMD ARGS
```

`run.py` provides all commands for running experiments, including training and evaluation. To list all available commands, use:

```
python run.py help
```

You can use `--help` or `-h` to view help for a specific command. For example, `python run.py pretrain -h` displays the arguments and help information for the `pretrain` command.

## Configure Experiments

Most configurations (e.g., model architectures, hyperparameters, testing arguments) are defined in the `configs.Config` class. Each `Config` instance supports running the entire pipeline and is identified by a keyword name.

We provide `Config` instances for Sudoku and Mult in `configs/sudoku.py` and `configs/mult.py`, respectively. You can modify the existing values in these files to customize your experiments. The supported configuration names are:

- mult-1m (*using reduced thought states*)
- mult-1m-direct (*direct reasoning without intermediate thought*)
- mult-1m-concat (*using complete thought states that concatenate all historical steps*)
- mult-4m
- mult-4m-direct
- mult-4m-concat
- mult-16m
- mult-16m-direct
- mult-16m-concat
- sudoku-1m
- sudoku-1m-direct
- sudoku-1m-concat
- sudoku-4m
- sudoku-4m-direct
- sudoku-4m-concat
- sudoku-16m
- sudoku-16m-direct
- sudoku-16m-concat

To create your own configurations, instantiate a new object using `Config(name, ...)` and import the file containing the instantiation. If you only want to make minor changes to an existing configuration, you can use `old.derive(new_name)` to deep-copy an old configuration (except for its name) and then modify its attributes.

## Experiment Pipelines

> **Important Note**: We have not yet debugged under multi-GPU / multi-node setting (though we have reserved the interface), since the model is very small. It is very likely that you _encounter unkown problems_ in running the pipeline when using multiple GPUs. Try adding `VISIBLE_CUDA_DEVICES=x` or `--devices 1` to specify a single GPU. We are working on multi-GPU / multi-node implementation to scale up experiments.

### Non-Reflective Training and Testing

Using `sudoku-4m` as an example, you can train the model using the following pipeline, which includes data preparation, pretraining, SFT, and GRPO:

```
python run.py prepare sudoku-4m
python run.py pretrain sudoku-4m
python run.py sft sudoku-4m
python run.py rl sudoku-4m grpo --startpoint sft/final -o <GRPO-DIR>

// Evaluate SFT results
python run.py eval sudoku-4m sft/final --alg sampling-no-reflect --split test --verbose 2

// Evaluate RL results. You can use the `--litckpt-name` argument to test the best checkpoint.
python run.py eval sudoku-4m <GRPO-DIR> --alg sampling-no-reflect --split test --verbose 2 [--litckpt-name "epoch=xxx"]
```

If successful, all training outputs will be saved in `out/sudoku-4m`, including checkpoints `pretrain/final`, `sft/final`, and `grpo`.

### Reflective Training and Testing

To collect data for reflective SFT, run:

```
python run.py refldata sudoku-4m sft/final [--verbose 2]
```

Here, `--verbose` specifies the verbosity of logging, with `2` printing all collected samples. This stage only collects reasoning steps without generating any ground-truth labels. If successful, the collected samples will be saved in `out/sft/final/reflection_data`.

Ground-truth labels will be automatically generated during reflective SFT. To specify the type of self-verification (binary or detailed) and the reflection frequency (if not `1`, samples with empty verifications will be added), modify the following values in `configs/sudoku.py`:

- **Detailed verification**:
    - `reflection_label="detailed"`
    - `reflection_frequency=1`
- **Optional detailed verification**:
    - `reflection_label="detailed"`
    - `reflection_frequency=0.5`
- **Binary verification**:
    - `reflection_label="binary"`
    - `reflection_frequency=1`

Then, run the following commands to perform reflective SFT and RL. If successful, the outputs will be saved in `out/sudoku-4m/sft+refl/final` and `out/sudoku-4m/<GRPO-DIR>`.

```
python run.py reflsft sudoku-4m sft/final
python run.py rl sudoku-4m grpo --startpoint sft+refl/final --reflective -o <GRPO-DIR>
```

To evaluate `sft+refl/final` or `<GRPO-DIR>`, run:

```
python run.py eval sudoku-4m <DIR> --alg sampling-no-reflect --split test --verbose 2
python run.py eval sudoku-4m <DIR> --alg sampling-self-reflect --split test --verbose 2
python run.py eval sudoku-4m <DIR> --alg sampling-self-reflect-traceback --split test --verbose 2
```

The `--alg` argument specifies the test-time execution mode. "no-reflect" means non-reflective execution, "self-reflect" means RMTP execution, and "self-reflect-traceback" means RTBS execution. "sampling" indicates inferring steps using a temperature of 1 during the first attempt on each thought state. You can replace "sampling" with "greedy" (e.g., "greedy-self-reflect") to specify a temperature of 0 for the first attempt, which is preferred in Mult.

## Tabulate Results

Use `fig.py` to tabulate evaluation results:

```
python fig.py table sudoku
python fig.py table mult
```

This searches for all evaluation results (files with the suffix `.eval.json`).
You can use `--include` and `--exclude` to filter files by checking whether their paths include or exclude certain keywords. Multiple keywords are separated by commas. For example, `--include sft,self-reflect --exclude traceback` presents all results tested in RMTP execution after reflective SFT.

