# Data Directory

Place your training datasets in this directory.

## Expected Format

Your training dataset should be a JSON file with the following structure:

```json
[
  {
    "id": "unique_identifier",
    "video": "path/to/video.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nQuestion with multiple choice options\nA. Option 1\nB. Option 2\nC. Option 3\nD. Option 4"
      },
      {
        "from": "gpt",
        "value": "<think>\nReasoning process explaining the solution step by step\n</think>\n\nThe answer is A."
      }
    ]
  }
]
```

## Key Requirements

1. **Chain-of-Thought Format**: Responses must include `<think>` tags with reasoning
2. **Multiple Choice**: Questions should have clearly labeled options (A, B, C, D)
3. **Video Paths**: Video paths should be relative to this data directory or absolute paths
4. **Answer Format**: Final answer should clearly state the chosen option

## Recommended Datasets

- Video-R1-260k.json (for RL training)
- Video-R1-COT-165k.json (for SFT pretraining)

Download these datasets from: [Video-R1 HuggingFace](https://huggingface.co/datasets/Video-R1/Video-R1-data)

## File Naming

Use descriptive names for your datasets, e.g.:
- `dual_reasoning_train_4k.json`
- `video_qa_reasoning_dataset.json`
- `multimodal_reasoning_data.json`