

# Data Building Guide

This guide outlines the directory structure, dataset composition, and processing pipeline for building training and evaluation data for the UCPO project.

## 📂 Directory Structure

Organize your `data/` directory as follows to ensure compatibility with the processing scripts:

```text
data/
├── general/           # General knowledge datasets (e.g., MMLU, GPQA)
│   ├── gpqa_diamond.json
│   └── mmlu_redux2_processed_test.json
├── math/              # Mathematical reasoning datasets
│   ├── aime/
│   ├── amc/
│   ├── math/
│   ├── minerva/
│   └── olympiad_bench/
│   └── dataset_dict.json
├── train/             # Final processed training data (Parquet format)
│   ├── math_dapo_17k_processed_uc.parquet
│   └── mmlu_redux2_processed_train_uc.parquet
├── process_data.py    # Script for preprocessing general task datasets
└── jsonl2parquet.py   # Script for converting JSON/JSONL to Parquet

```

---

## 📊 Dataset Composition

We use a hybrid approach combining mathematical reasoning and general-purpose knowledge to evaluate model performance and uncertainty calibration.

### 1. Reasoning Tasks (Math & Logic)

Used to evaluate logical consistency and depth within the model's knowledge boundaries.

* **Training Source**: [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k)
* **Evaluation Benchmarks**: AIME 2024, AMC 2024, MATH500, Minerva, and Olympiad Bench. Processed data source: [understand-r1-zero](https://github.com/sail-sg/understand-r1-zero)

### 2. General Tasks (Multiple Choice)

Used to analyze overconfidence and forced-choice behavior in constrained scenarios.

* **MMLU-Redux2**: [Link](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) (1,000 samples for testing; remainder for training).
* **GPQA-Diamond**: [Link](https://huggingface.co/datasets/Idavidrein/gpqa).

---

## ⚙️ Preprocessing General Tasks

The `process_data.py` script standardizes raw JSON datasets by:

1. Merging the core question with all available options.
2. Mapping numeric indices to choice letters (e.g., `0` → `A`, `1` → `B`).
3. Generating a unified string block for the model.

**Execution:**

```bash
python process_data.py

```

**Output Example (`gpqa_diamond.json`):**

```json
[
  {
    "question": "What is the capital of France?\n\nA. London\nB. Berlin\nC. Paris\nD. Madrid",
    "answer": "C"
  }
]

```

## 📝 Training Format Conversion

The `jsonl2parquet.py` script converts processed JSON files into Parquet format for high-performance training. This step includes:

1. Applying a **universal prompting template** to guide model reasoning.
2. Mapping data to standardized fields: `prompt`, `label`, and `metadata`.

**Execution:**

```bash
python jsonl2parquet.py --local_file data/general/math_dapo_17k_processed.json

```
