#### Overview

This is the official implementation code for AdapThink. The following guide will help you get started with setting up the environment, preparing the data, running training and evaluation scripts, and understanding the results.


#### License
This project is licensed under the Apache License, Version 2.0. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

#### Constants
The `global_const.py` file defines several constants that affect the whole project:
- **DTYPE_CLASS**: A dictionary mapping data type names to PyTorch data types. Supported types include `float16`, `bfloat16`, `float32`, and `auto`.
- **CKPT_FOLDER**: The folder where model checkpoints are stored, set to `../output`.
- **BASE_MODELS**: A list of base models supported by the project, including various versions of Llama, ChatGLM, Galactica, Mistral, Qwen, etc.
- **MODEL_SIZE**: A dictionary mapping base model names to lists of specific model sizes.


####  Alternative Rewards and Metrics 
| Reward Function Name | Function | Reward Type |
| --- | --- | --- |
| math_accuracy_reward | Checks if the completed answer is the same as the true answer using the math - verify library. Returns a reward of 1 if they match, otherwise 0. | Outcome Reward |
| think_ans_format_reward | Checks if the completed answer has a specific format ($<think>...</think><answer>...</answer>$). Returns a reward of 1 if it meets the format, otherwise 0. | Process Reward |
| adaptive_reasoning_control_reward | An adaptive reasoning control reward that combines length, process switching, and depth control, adjusting the reward strategy based on the overall statistic. | Process Reward |
| CosFn_reward | Cosine reward function based on completion correctness and length. | Process Reward |
| LCPO_max_reward | Reward function that implements the L1-Max variant from Length Controlled Policy Optimization. | Process Reward |
| TLB_reward | Reward function that implements Token Length Budget (TLB) based calibration. | Process Reward |
| group_average_sequential_reward | Calculates the average number of sequential pattern occurrences in a group of completion texts. | Process Metric |
| group_average_depth_reward | Calculates the average number of depth - related keywords (e.g., "wait", "check") in a group of completion texts. | Process Metric |
| group_average_output_reward | Calculates the average number of output - related keywords (e.g., "**final answer**") in a group of completion texts, grouped by a specified number of generations. | Process Metric |
| length_group_diversity | Computes the length - based group diversity of completion texts, using entropy to measure the diversity of text lengths grouped into different ranges. | Process Metric |
| process_group_depth_diversity | Computes the depth - based group diversity of completion texts by counting the occurrences of transition words related to depth and grouping them into ranges, then calculating entropy. | Process Metric |
| process_group_switch_diversity | Computes the switch - based group diversity of completion texts by counting the occurrences of transition words related to switching and grouping them into ranges, then calculating entropy. | Process Metric |



#### Datasets
We conduct experiments on a curated lightweight mathematics dataset that spans various difficulty levels. This dataset combines questions from *DeepScaleR - Preview - Dataset* [1], including about 5K question - answer pairs sampled from AIME (1984 - 2023), AMC (prior to 2023), and MATH training sets.

#### Getting Started

##### Installation
```bash
# Installing Python 3.10 Environment.
conda create -n adapthink python=3.10 -y
conda activate adapthink

# Installing adapthink dependencies.
cd adapthink
pip install -r requirement.txt
```

##### Training Scripts
We provide training scripts for Adapthink model in the `adapthink_train.sh`. 

```bash
adapthink_train.sh
```

#### Evaluation 
Our evaluation scripts automatically run many replicas of vLLM. To run our evaluation scripts, run:
```bash
adapthink_test.sh
```

#### References
[1] [\[Citation information for DeepScaleR - Preview - Dataset\]](https://github.com/agentica-project/rllm/tree/main)
