# Reasoning Trace Analysis

This repository contains code for analyzing reasoning traces, generating synthetic datasets, and fine-tuning language models to improve mathematical reasoning. The implementation is based on research exploring how different types of reasoning traces affect model performance.

## Overview

We investigate how different types of reasoning traces impact model performance on mathematical reasoning tasks. Our approach includes:

1. **Analyzing reasoning traces** to identify pivot types and patterns
2. **Creating synthetic datasets** with controlled reasoning structures
3. **Fine-tuning models** on these datasets
4. **Evaluating performance** on mathematical reasoning benchmarks


## Datasets

Our research involves three main datasets:

1. **SmolTraces (ST)**: High-quality reasoning traces generated by DeepSeek R1
2. **SmolTraces-HardCoded (ST-HC)**: Synthetic dataset with templated reasoning patterns using GPT-4o
3. **SmolTraces-HardCoded-Wrong (ST-HC-W)**: Modified ST-HC with incorrect but plausible answers

## Usage

### Dataset Generation

The `generate_st_samples.py` script allows you to generate the SmolTraces (ST) dataset using DeepSeek's R1 reasoning model.

To generate a SmolTraces dataset using DeepSeek R1:

```bash
python -m data_generation.generate_st_samples \
    --output_dir "datasets/SmolTraces-R1"
```

### Command Line Arguments

- `--start_idx`: Starting index for processing (default: 0)
- `--num_samples`: Number of samples to process (default: 10)
- `--max_workers`: Maximum number of concurrent workers (default: 3)
- `--chunk_size`: Size of each processing chunk (default: 5)
- `--output_dir`: Output directory (default: "datasets/SmolTraces-R1")
- `--save_individual`: Save individual sample files for debugging

**Note:** DeepSeek R1 API responses can take 5-15 minutes per call, so dataset generation will be time-consuming.

### Dataset Output

The generated dataset will be:
1. Saved locally to the specified output directory in both JSON and JSONL formats
2. Metadata will be saved to track dataset statistics and version information

### Generating Synthetic Datasets
Generate the ST-HC dataset:
```bash
python data_generation/hardcoded_trace_generation.py \
  --input_dataset "datasets/combined_seed_datasets.json" \
  --api_key "your-api-key" \
```

Generate the ST-HC-W dataset:
```bash
python data_generation/wrong_answer_generation.py \
  --input_dataset "datasets/combined_seed_datasets.json" \
  --api_key "your-api-key" \
```

### Dataset Processing

Balance ST and ST-HC datasets to ensure they have the same size:
```bash
python data_generation/balance_datasets.py \
  --st_path "datasets/SmolTraces-R1" \
  --st_hc_path "datasets/SmolTraces-HC" \
  --output_dir "datasets/balanced"
```

Decontaminate datasets by removing samples that overlap with evaluation questions:
```bash
python data_generation/decontaminate_datasets.py \
  --seed_paths "datasets/SmolTraces-R1" "datasets/SmolTraces-HC" \
  --eval_paths "datasets/eval_data/HuggingFaceH4_MATH-500" ... \
  --output_dir "datasets/decontaminated"
```

### Training and Evaluation

For training and evaluation, we use the code provided in the [SkyThought](https://github.com/NovaSky-AI/SkyThought/tree/main/skythought) repository.


## Benchmarks

We evaluate models on the following benchmarks:

- **MATH500**: A subset of challenging mathematics problems
- **AIME2024**: Problems from the American Invitational Mathematics Examination
- **GPQA-Diamond**: Complex questions from the GPQA benchmark

## Analysis Metrics

We analyze:

- **Accuracy** across benchmarks
- **Pivot counts** in reasoning traces
- **Token usage** in successful vs. unsuccessful attempts


## Setup

1. Install requirements:
   ```bash
   pip install -r requirements.txt
   ```

2. Create a `.env` file with your API keys:
   ```
   OPENAI_API_KEY=your_openai_key_here
   DEEPSEEK_API_KEY=your_deepseek_key_here
   ```


## Analyzing Reasoning Traces

The `trace_analysis.py` script in the `analysis` directory provides methods for analyzing reasoning traces:

1. **Regex-based analysis** (default): Uses regular expressions to identify pivots and structures in reasoning traces.
2. **GPT-based analysis**: Uses GPT-4o-mini to identify pivots and structures through API calls.

### Usage Examples

#### Regex-based Analysis (Default)
```bash
python -m analysis.trace_analysis \
    --dataset "insert-dataset-name-here" \
    --split "train" \
    --output_dir "analysis_results/insert-dataset-name-here"
```

### Command Line Arguments

- `--dataset`: Dataset name on Hugging Face (required)
- `--split`: Dataset split to analyze (default: "train[:500]")
- `--output_dir`: Directory to save analysis results (default: "analysis_results")
- `--num_templates`: Number of trace templates to extract (default: 5)
- `--no_viz`: Skip visualization generation (faster)
- `--use_gpt`: Use GPT-4o-mini for pivot and structure identification
- `--api_key`: OpenAI API key (optional, will use .env file if not provided)