# SmolTraces Datasets

This dataset contains high-quality reasoning traces generated by the DeepSeek-R1 or GPT-4o. The dataset is designed for training and evaluating reasoning capabilities in language models.

## Dataset Structure

The dataset is organized in the following structure:

- `samples.json`: Main JSON file containing all samples
- `samples.jsonl`: Line-delimited JSON version of the dataset (more efficient for large datasets)
- `metadata.json`: Metadata about the dataset, including version and sample counts
- Individual sample files (if saved with `--save_individual` flag)

Each sample contains:
- `question`: The original question posed to the model
- `thinking`: The step-by-step reasoning process generated by the DeepSeek-R1 model
- `answer`: The extracted final answer (without the LaTeX \boxed{} notation)
- `expected_answer`: The ground truth answer
- `domain`: The domain or category of the problem
- `dataset`: The source dataset of the problem


## Important Notes

- DeepSeek R1 API responses take time (5-15 minutes per call)
- The API may return malformed JSON responses despite 200 status codes
- The script automatically retries with increasing delays
- Consider reducing max_workers if you experience too many failed attempts
