# <img src="figures/icon.png" width="30"/> *HAMMER*: Hamiltonian Curiosity Augmented Large Language Model Reinforcement

<!-- <p align="center">
  <img src="figures/icon.png" width="100"/>
</p> -->


This is the official implementation of the paper [*HAMMER*: Hamiltonian Curiosity Augmented Large Language Model Reinforcement](README.md).

Recent curriculum reinforcement learning for large language models (LLMs) typically rely on difficulty-based annotations for data filtering and ordering. 
However, such methods suffer from local optimization, where continual training on simple samples in the early steps can cause the policy to lose its exploration.
We propose a novel schema, namely 
**Hamiltonian curiosity augmented large language model reinforcement (HAMMER)**,
that transfers diversity metrics, commonly used in dataset evaluation, into the dynamic reinforcement learning procedure, where training samples are ordered via a minimum-semantic Hamiltonian path 
making the initial training retrain more exploration.
From a theoretical perspective of generalization bounds, diversity-driven ordering facilitates stable convergence.
Empirical evaluations indicate that **HAMMER** stimulates model "curiosity" and consistently achieves a 3\% to 4\% average accuracy gain across diverse inference benchmark.

<p align="center">
  <img src="figures/overview.png" width="1000"/>
</p>

<p align="center">
  <img src="figures/illustrations.png" width="500"/>
</p>

## Requirements

- Python
- [verl](https://github.com/volcengine/verl) framework
  
Please refer to https://github.com/volcengine/verl for more details.

## Build

```shell
git clone https://github.com/volcengine/verl
cd verl 
pip install -r requirements.txt
pip install -e .
```

## Dataset

All datasets are detailed in the following Table.
We evaluate our method on four benchmark datasets for mathematical problem solving. 
AIME 2024 (30 problems), AIME 2025 (30 problems),
AMC (83 problems) and
Olympiad (675 problems). 
All models are trained on the DeepScaleR (40,315 problems), which provides high-quality synthetic reasoning traces designed to enhance step-by-step mathematical reasoning. 
In our *HAMMER*, DeepScaleR is ordered by minimal similarity (Hamiltonian Curosity Order)
The embedding of $\mathcal{X}$ is computed by mean pooling.
Our experiments show that varying $\eta$ has little impact on overall performance, so we fix $\eta=3$.

| Dataset    | Size   | URL                                                                                             | Description                                                                 |
|------------|--------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **AIME 2024** | 30     | [link](https://huggingface.co/datasets/HuggingFaceH4/aime_2024)                               | The dataset consists of 30 problems from the 2024 AIME I&II tests.          |
| **AIME 2025** | 30     | [link](https://huggingface.co/datasets/opencompass/AIME2025)                                  | This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I&II. |
| **AMC 2023**  | 83     | [link](https://huggingface.co/datasets/math-ai/amc23)                                         | All 83 problems come from AMC 2023 competition.                             |
| **Olympiad**  | 675    | [link](https://huggingface.co/datasets/math-ai/olympiadbench)                                 | A challenging benchmark for promoting AGI with olympiad-level problems.     |
| **DeepScaleR** | 40315 | [link](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset)              | The dataset consists of ~40000 unique mathematics problem-answer pairs compiled from: AIME (1984-2023) and AMC (prior to 2023). |

You can use the script in ``HamiltonianCuriosityOrder`` to generate the Hamiltonian Curiosity Order of DeepScaleR. See more details by

```shell
python HamiltonianCuriosityOrder/sample_similarity_matrix.py --help
python HamiltonianCuriosityOrder/sample_reorder_hamilton.py --help
```


## Experiments

All experimental sections are corresponding to a code file in ``experiments``.


| 🔬 Experiment | ✍ Code |
|--|--|
| Training | Training Demo:<br> (1) ``RLVR/qwen3-1.7b-DeepScaleR-AIME24-baseline-dapo.py`` <br> (2) ``RLVR/qwen3-1.7b-DeepScaleR-AIME24-hammer-dapo.py`` |
| Main Experiment | (1) ``plot/MainExperiment/main_results.py`` compute metrics in main experiments.<br> (2) ``plot/MainExperiment/metric_pareto_calc.py`` compute results for metric distribution. <br> (3) ``plot/MainExperiment/metric_pareto_plot.py`` plot results for metric distribution. |
| Ablations Study | (1) ``plot/Abalations/batchsize.py`` batch size albations. <br> (2) ``plot/Abalations/E2H.py`` plot "easy-to-hard" or "hard-to-easy" figures. <br> (3) ``plot/Abalations/MaxSimilarity.py`` plot max similarity ablations figures. <br> (4) ``plot/Abalations/utils.py`` parse logs from verl, if your experimental environment is not available with ``wandb``.|
| Training Dynamic | ``plot/TrainingDynamic/qwen3-1.7b-dapo.py`` qwen3-1.7b-dapo training dynamic |


**Baselines**: For the main experiments, we use DAPO and GRPO trained on randomly shuffled samples as baselines, including Qwen3-1.7B-DAPO, Qwen3-1.7B-GRPO, Qwen3-4B-DAPO, and Qwen3-4B-GRPO.

**Training**: In our experiment, we adapt Qwen3-1.7B and Qwen3-4B as the backbone model, and train on verl through GRPO and DAPO.
Models for main experiment were trained with a batch size of 16 (including mini-batch size). The maximum prompt length is set to 1024 tokens, and the maximum response length is 8192 tokens. 
For the training hyper-parameters, learning rate is fixed at $1\times 10^{-6}$ without warmup step. For GRPO we adopt KL regularization (coefficient $\beta=0.001$). For DAPO, we set $\varepsilon_{\text{low}}=0.2$ and $\varepsilon_{\text{high}}=0.28$ and token-level policy gradient loss, and dynamically filter samples by accuracy during training. 
Each training step generates 16 rollouts, while validation (in dynamic experiments) uses 8 rollouts. The rollout temperature is 1.2, and the validation temperature is 0.6. 
For the reward, if the $i$-th rollout passes verification, it is assigned a positive reward $r_i = 1$; otherwise, it receives $r_i = 0$. 

**Evaluation**:
To evaluate LLM performance, we set temperature to $1.2$, with top-$p=0.95$ and top-$k=20$ with 8192 context length. For AIME 2024, AIME 2025, and AMC 2023, we sample $1$, $10$, and $100$ responses 10 times and report average $pass@k$ $(k \in {1,10,100})$ and $cons@100$, measuring solution accuracy and response consistency. For Olympiad, due to its larger size, we evaluate only $pass@1$, $pass@10$, $pass@32$, and $cons@32$, which are sufficient for reliable estimation.

## Main Experiment Results

<p align="center">
  <img src="figures/main_results.png" width="1000"/>
</p>

## Reference
```

```
