### From \<Answer\> to \<Think\>: Modeling the Reasoning Process via Multidimensional Assessment

This repo serves as the code for paper "From \<Answer\> to \<Think\>: Modeling the Reasoning Process via Multidimensional Assessment", submitted to ICLR 2026.

please place your model in `models/` and your benchmarks in `data/benchmark/`

Our codes provide the basic use of reasoning modeling via multidimenstional assessment, while you can add models, add benchmarks, add dimensions and to what you like!

### DPO

You can install dependencies using `pip install -r requirements.txt`, though some of them may have to be installed manully.

#### Quick Start

First: `cd DPO` :)

```bash
cd script
python step1_get_all_dataset.sh
```

This creates inputs for LLM to generate their reasoning with a final answer.

We use various generation frameworks (VLLM, SGLang, Transformers, etc.), so inference code is not included here. Just run generation with whatever framework you like — it’s a simple process. As your generation process's result, your output should be a `.jsonl` file as each line a `dict{"text:..."}` placed at `works/DPO/output/example_dataset/example_model/output_0.jsonl`. We also offer our example for reference.

Once you get output collected, just `python reasoning_gen_data_judger.py`, which will yield inputs for multidimensional assessment.

We offer several score functions:

```bash
# Confidence
cd script
# Gets all log probabilities
bash score_confidence.sh
cd ..
# Calculate confidence score
python logp_handler.py

# Relevance
cd script
bash score_relevance.sh

# Coherence
cd script
# Setup sglang for acceleration
bash score_coherence_sglangserver.sh &
bash score_coherence.sh
```

Once all scores judged, run this to correct scores:

```bash
cd script
bash step2_cal_dimension_score.sh

# This to construct off-policy training set. Please remind to edit score names as in our example
bash step3_get_rl_data.sh

# Conduct training! Please remind to edit score names as in our example; e.g., RLVR@T+F from the paper.
bash step4_train_DPO.sh
```

Once you have finished training, you can use your generation framework to test its performance. We leave our evaluation fuctions in `tool.py` as `compute_em` and `compute_f1`. Have fun!

### GRPO

First: `cd GRPOframework`
Its a open-sourced framework as we mentioned in our paper so a little bit heavy compared to pip-based DPO framework. This framework requires the installation of MEGATRON and supports of MPI.

The training is very simple:

```bash
# llama3 for example, dsllama, qwen3
bash ./tasks/math_rl_v3/llama3/mpirun-grpo.sh
```

, where you need to do these in advance:

- set `metadata` in each training script directory to point to your training file
- add special tokens to your training file using the trained model's tokenizer
- convert HF models to MLM models
  ```bash
  bash ./tasks/math_rl_v3/llama3/conver_ckpt_actor.sh hf_to_mlm
  ```
- fill in the TODOs in `/tasks/math_rl_v3/llama3/grpo.sh`
- for reasoning modeling, start relevance and coherence SGLang backend and update IP in `./tasks/math_rl_v3/rule_critic_model`
- recommended: 4-multiples machines × 8 GPUs each to run this script

And please remind to convert MLM checkpoints back to HF models for inference when training finished.

```bash
bash ./tasks/math_rl_v3/llama3/conver_ckpt_actor.sh mlm_to_hf $modelname $steps
```

Have fun!:)
