# rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

## Sample Data

The `sample_data.json` file contains examples of our dataset:

- `original_question`: The original seed problem from competitive programming platforms
- `synthetic_question`: New problems synthesized based on the seed problem
- `long_cot_solution`: A long chain-of-thought solution generated by SOTA reasoning model, including both thinking process and final code implementation
- `synthetic_test_case`: Generated test cases for the synthetic problem, including:
  - `input`: Test input, covering different scales
  - `consistent_output`: Verified output that passed mutual verification process

Note: Due to file size limitations, the test cases shown in sample_data.json are just examples. For large-scale test cases (e.g. with input size of 10^5), the actual data files may be significantly larger.


## Project Structure

```
src/
├── synthesize_problem_and_input_utility_function.py  # Problem synthesis and input utility function generation
├── test_input_generation.py                          # Test input generation
├── solution_execution.py                             # Solution execution
├── solution_input_mutual_verification.py             # Solution mutual verification
├── prompt.py                                         # Prompt templates
└── utils/                                           # Utility functions
```

## Core Components

### 1. Problem Synthesis
- Location: `synthesize_problem_and_input_utility_function.py`
- Features:
  - Generate new competitive code problems from existing seed problems
  - Create test input generation and validation functions
  - Support both standard I/O and function-based problems
- Classes:
  - `CodeProblemSynthesizer`: Problem synthesizer
  - `UtilityFunctionGenerator`: Test input utility generator

### 2. Test Input Generation
- Location: `test_input_generation.py`
- Features:
  - Parallel generation of test inputs
  - Support for varying scales and complexities
- Highlights:
  - Resource limits and timeout control
  - Automatic parameter scaling
  - Parallel processing optimization

### 3. Solution Execution
- Location: `solution_execution.py`
- Features:
  - Safe execution of code solutions
  - Support for both stdin/stdout and function call modes
  - Resource control and error handling
- Security Features:
  - CPU and memory limits
  - Timeout control
  - Process isolation
  - I/O handling
  - Error management

### 4. Mutual Verification
- Location: `solution_input_mutual_verification.py`
- Features:
  - Correctness verification through solution consistency
  - Low-quality or incorrect solution filtering
- Mechanism:
  - Determine correct outputs through majority voting
  - Filter high-quality problems based on consistency ratio

## Usage

### 1. Problem Synthesis
```bash
python src/synthesize_problem_and_input_utility_function.py \
  --qaf <input_file> \
  --synthesizer code_problem_synthesis \
  --query_model <model_name> \
  --output_json <output_file>
```

### 2. Test Input Generation
```bash
python src/test_input_generation.py \
  --input_file <input_file> \
  --output_file <output_file> \
  --num_processes <processes> \
  --batch_size <batch_size>
```

### 3. Solution Execution
```bash
python src/solution_execution.py \
  --input_file <input_file> \
  --output_file <output_file> \
  --max_cases <num_cases>
```

### 4. Mutual Verification
```bash
python src/solution_input_mutual_verification.py \
  --execution_result_paths <result_paths> \
  --output_path_prefix <output_prefix> \
  --consistency_ratio <ratio>
```

## System Requirements

- Python 3.9+
- Dependencies:
  - pandas
  - numpy
  - pebble (parallel processing)
  - cyaron (test case generation)


