
# K&K Data Generation, Fine-Tuning, and Evaluation

## 1. Generate Synthetic Data

To generate K&K data for {2,3,4,5,6,7,8}-people puzzles with a train/test/val split, run the following script:

```bash
python data_gen_kk.py
```

Locally perturbed data and wrong answer/wrong CoT data will be generated as well.
The generated data will be stored in the `data/` directory. 


## 2. Fine-Tuning

Fine-tune the model for `10` epochs and save checkpoints at every `0.2` ratio of total training steps (i.e., save the model at the 2nd, 4th, 6th, 8th, and 10th epochs).

Feel free to use larger FT epochs to achieve higher acc on training samples. 


### Fine-Tuning Without CoT (Chain of Thought)

To fine-tune the model without CoT, run:

```bash
bash scripts/ft/ft_lm3.sh
```

### Fine-Tuning With CoT

To fine-tune the model with CoT, run:

```bash
bash scripts/ft/ft_lm3_cot.sh
```

You can change the saved model path `output_dir` in the above scripts.

## 3. Merge Fine-Tuned Adapter and Base Model

Load the saved adapter from step 2 and the base model, then save the merged model by running:

```bash
bash scripts/ft/merge_adapter.sh
```

Make sure to change the model paths in the script as needed:

```python
base_model_path="meta-llama/Meta-Llama-3-8B"  # Base model path
target_model_path=""  # Merged model save path
adapter_path=""  # Adapter path from fine-tuning
```

## 4. Evaluation

### General Evaluation Parameters

- **Max Tokens `max_token`:** 2048 (Feel free to use larger max_token)
- **Sample Limit  `limit`:** Set the number of samples for evaluation. We used 100.

### Evaluation on Test Samples

Evaluate on test samples under 1/0-shot & with/without CoT by running:

```bash
bash scripts/eval/run_test.sh
```

### Evaluation on Math-Perturbed Test Samples

Evaluate under 0-shot & without CoT & using two math perturbation methods:

```bash
bash scripts/eval/run_test_pertub.sh
```


### Evaluation on Original Training Samples

#### 0-Shot & With/Without CoT

```bash
bash scripts/eval/run_train-0shot.sh
```

#### 1-Shot & With/Without CoT

```bash
bash scripts/eval/run_train-1shot.sh
```

### Evaluation on Perturbed Training Samples

Evaluate under 0-shot & without CoT & using all perturbation methods:

```bash
bash scripts/eval/run_train_pertub.sh
```

### Evaluation on Closed-sourced models

Provide API keys and run:

```bash
bash scripts/eval/gpt4omini_cot.sh
bash scripts/eval/gpt4omini_direct.sh
bash scripts/eval/claude-sonet.sh
```


## 5. Probe

Update the model paths and the number of people in the puzzles for evaluation in the script: 
```bash
bash scripts/probe/run.sh
```


## 6. Classification on Consistenly Solved v.s. non Consistenly Solved Puzzles
Update the model paths and provide the labels of Consistenly Solved v.s. non Consistenly Solved for each training sample, and then run the following.

Puzzled-based indicators:

```bash
bash scripts/mem_classify/model_indicator.sh
```

Model-bases indicators:

```bash
bash scripts/mem_classify/puzzle_indicator.sh
```
