# FoVer

This repository includes code and materials for the paper "Generalizable Process Reward Models via Formally Verified Training Data".

## Introduction

Process reward models (PRMs), which provide step-by-step feedback on the reasoning generated by large language models (LLMs), are receiving increasing attention for their potential to enhance LLMs via reinforcement learning and inference-time refinement.

We propose FoVer, an approach for training PRMs on step-level error labels that are automatically annotated using formal verification tools (e.g., Z3, Isabelle). We introduce a dataset that includes automatically annotated step-level error labels on LLM responses for the formal logic and proof tasks. We demonstrate that LLM-based PRMs trained on the FoVer dataset exhibit cross-task transfer of verification capabilities learned in formal logic and proof, leading to improved verification across a broad range of reasoning tasks, including mathematics, academic problems, logic, and abstract reasoning.

<div align="center"><img src="readme_figures/fover_overview.png" width="600"></div>


## Setup

Please refer to [setup/setup.sh](setup/setup.sh).

We run our experiments on the following environment. You might need to modify configulations if you are using a different environment.

* Four NVIDIA A100 SXM4 80GB GPUs
* CUDA Version: 12.2

## FoVer Dataset

We provide the FoVer datasets that include the mistakes made by Llama 3.1 8B and Qwen 2.5 7B on formal logic and proof tasks.

### Dataset Format

Each instance of the FoVer datasets include the following items.

* `problem` (str)
* `solution_steps` (list[str])
  * The solution steps generated by the model.
* `error_labels` (list[str])
  * The ground-truth error labels generated by the error verification tools (Z3, Isabelle)
* `messages` (list[dict[str, str]])
  * The conversation we use for fine-tuning our PRMs.
* `messages_for_prediction` (list[dict[str, str]])
  * The conversation we use for prediction. The model outputs are dummy values and all `correct`.
* `problem_witout_definition` (str)
  * The `problem` without task definition (metadata, not used in our experiments).

### Dataset Statistics

<div align="center"><img src="readme_figures/fover_stats.png" width="600"></div>

### LastStepBalanced Dataset

We create the LastStepBalanced dataset to train PRMs on the balanced dataset where the last step includes 50% of correct and 50% of incorrect steps. We truncate solutions to make the last step balanced, so we expect to mask all steps but the last step to train the PRMs.

Specificlaly, we use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) with the option `mask_history: true`.

## Reproducing the Experiments in the Paper

You can refer to shell files in the [run](run) directory to reproduce the experiments in our paper.
