# LS-DIff

LS-DIff is a research toolkit for evaluating large language models on math datasets and refining their answers through latent-state optimization.

## Features

- **Configurable evaluation script** – `main.py` exposes command-line arguments for dataset selection, optimization hyperparameters, and run control.
- **Dataset loader with formatted prompts** – `data.py` loads GSM8K, MATH-500, or AIME_2024 samples and applies dataset-specific solver prompts defined in `prompts/solver_prompts.py`.
- **Generation methods**
  - `ori_generation.py` produces the initial answer while storing hidden states for each generated token.
  - `opt_generation.py` updates a subset of hidden states via gradient steps to improve a reward-model score.
- **Answer extraction and scoring** – `extract_judge_answer/utils.py` extracts numeric answers with an auxiliary model and verifies correctness using multiple checks.

## Project layout

```
.
├── main.py               # Evaluation and optimization pipeline
├── data.py               # Dataset loader
├── ori_generation.py     # Baseline generation
├── opt_generation.py     # Latent-state optimization
├── extract_judge_answer/ # Answer extraction & judging helpers
├── prompts/              # Dataset-specific solver prompts
├── rewards/              # Reward model implementation
└── scripts/              # Utility scripts (run.sh, combine_results.py)
```

## Running

Install dependencies such as `torch`, `transformers`, `datasets`, `tqdm`, and `wandb`, then provide a model checkpoint.

```
python main.py \
  --dataset openai/gsm8k \
  --model_name_or_path path/to/model \
  --output_dir results \
  --solver_prompt_idx 0 \
  --lr 0.03 --k 0.1
```

The script logs metrics to Weights & Biases and stores intermediate outputs under the specified `--output_dir`.

## Scripts

- `scripts/run.sh` – example command runner
- `scripts/combine_results.py` – merge multiple result files

