## Motivation

Advanced mathematical problem-solving represents a cornerstone of artificial intelligence capabilities, demanding intricate logical reasoning, sophisticated pattern analysis, and systematic multi-step solution development. Text-based mathematical challenges test models' ability to comprehend abstract mathematical frameworks, deploy optimal problem-solving methodologies, and derive accurate conclusions through rigorous analytical processes without visual assistance. Existing language models frequently encounter difficulties with mathematical tasks that necessitate profound comprehension of mathematical theories, complex computational sequences, and robust logical reasoning pathways.

This challenge targets the essential requirement to advance language models' mathematical problem-solving proficiency through sophisticated training methodologies. The capacity to tackle intricate mathematical questions has substantial implications for educational technology, intelligent tutoring platforms, academic research support, and broader artificial intelligence advancement, establishing this as a crucial benchmark for assessing analytical reasoning competencies in mathematical contexts.

## Task

This challenge centers on improving the mathematical problem-solving capabilities of language models through advanced fine-tuning and data enhancement strategies. Your target is to make the final result as higher as possible. 

You need to:

1. **Dataset Enhancement and Expansion**: Utilize the provided training set and models to generate supplementary training examples through diverse enrichment methodologies including knowledge transfer, strategic data sampling, COT generation and structural reorganization. It may takes several hours, you MUST be paitent.

Save your dataset to `/workspace/task/repositories/LLaMA-Factory/data/training_datas.jsonl`,  the data format can be found in `/workspace/task/repositories/LLaMA-Factory/data/README.md`. LLaMA Factory will automatically seperate dev/train set from this file.

2. **Model Optimization**: Improve the designated model (`Qwen2.5-7B-Instruct`) using the enhanced dataset to boost performance on mathematical reasoning challenges. This may involve various training methodologies, including supervised fine-tuning (SFT). The process may take several hours, so you MUST be patient.

3. **Performance Assessment**: Evaluate the optimized model trained from `Qwen2.5-7B-Instruct` against the test dataset to quantify improvement gains via inference script and eval action.

All work should be conducted under the `/workspace` directory.

The output file should be stored in `/workspace/data/outputs/answer.jsonl`.

You can submit your answer in the file above for 3 times (with different training data and its corresponding inference result). You should try your best to get highest score.  


## Data

The dataset comprises three primary components:

### Training Set
- **Path**: `/workspace/data/datasets/train.jsonl`
- **Description**: 817 mathematical problem-solving questions
- **Schema**: `["question_id", "question", "answer"]`
- **Structure**: Each record contains a comprehensive mathematical challenge with its corresponding solution

### Validation Set  
- **Path**: `/workspace/data/datasets/valid.jsonl`
- **Description**: 100 validation problems with matching format to training set
- **Schema**: `["question_id", "question", "answer"]`
- **Function**: Model validation and hyperparameter tuning

### Test Set
- **Path**: `/workspace/data/datasets/test.jsonl`
- **Description**: 435 test problems (answer field excluded)
- **Schema**: `["question_id", "question"]`
- **Function**: Final model performance evaluation

### Model Resources
- **Main Model**: Qwen2.5-7B-Instruct (`/workspace/data/checkpoints/Qwen2.5-7B-Instruct`) - You should finetune your model based on this model.

- **Auxiliary Models** (for data enhancement/distillation):
  - DeepSeek-R1-Distill-Qwen-7B (`/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-7B`)
  - DeepSeek-R1-Distill-Qwen-32B (`/workspace/data/checkpoints/DeepSeek-R1-Distill-Qwen-32B`)
  - QwQ-32B (`/workspace/data/checkpoints/QwQ-32B`)

## Constraints

- **Time Constraints**: Complete pipeline including data preparation, training, validation, and test evaluation must be finished within 48 hours
- **Computational Resources**: Limited to 8 x 80GB GPUs for the entire workflow
- **Model Usage**: Final implementation must utilize exclusively the main model; supporting models are restricted to dataset enhancement and knowledge transfer purposes only

## Evaluation


## Evaluation

### Output Format Requirements

**File Location**: `/workspace/data/outputs/answer.jsonl`

**File Format**: JSONL (JSON Lines) - one JSON object per line

**Required Fields**: Each line must contain exactly these two fields:
- `"question_id"` (string): The unique identifier for each question (must match the question_id from test set)
- `"answer"` (string): The final mathematical answer or expression

**Answer Format**: 
- Provide the final numerical answer, mathematical expression, or symbolic result
- Can include LaTeX formatting for mathematical expressions (e.g., `"$\\frac{600}{7}$"`, `"$\\sqrt{15}$"`)
- Can be plain numbers (e.g., `"9"`, `"1002001"`, `"140"`)
- Can be multiple values separated by commas (e.g., `"$69$,$84$"`)
- Do NOT include explanations, reasoning, or solution steps
- For multiple answers, separate with commas


**IMPORTANT Notes**:
- File must contain answers for ALL test questions (435 questions)
- Question IDs are strings and must exactly match those in the test set
- Mathematical expressions should use proper LaTeX formatting when appropriate
- The evaluation uses mathematical equivalence checking, so `"3/4"` and `"$\\frac{3}{4}$"` are considered equal

## Evaluation Metrics

- Val Set: baseline = 7%

- Test Set: Human minimum = 25.9%, Human maximum = 65%

- The final score is calculated based on the accuracy of the test set.

## Environment

### File Structure
```
workspace/
├── data/
│   ├── checkpoints/
│   │   ├── DeepSeek-R1-Distill-Qwen-7B/   # read-only directory
│   │   ├── DeepSeek-R1-Distill-Qwen-32B/  # read-only directory
│   │   ├── Qwen2.5-Math-7B/               # read-only directory
│   │   └── QwQ-32B/                       # read-only directory
│   ├── datasets/
│   │   ├── train.jsonl                    # do not modify this file
│   │   ├── val.jsonl                      # do not modify this file
│   │   └── test.jsonl                     # do not modify this file
│   └── outputs/
└── task/
    ├── repositories/
    │   └── LLaMA-Factory/       
    ├── scripts/                           # you can add scripts here
    │   ├── utils/
    │   ├── hfd.sh                         # read-only file
    │   ├── inference.py
    │   ├── inference.sh                   # example script for running evaluation
    │   ├── judge.py                       # example script for running evaluation
    │   ├── judge.sh                       # example script for running evaluation
    │   └── training.sh                         
    └── task_description.md
```

### Execution Environment

A pre-configured Conda environment, `/workspace/conda`, has been provided and activated for this task. This environment includes the necessary packages for supervised fine-tuning using LLaMA-Factory.

## Scripts

### Available Resources
- **LLaMA-Factory**: Located at `/workspace/task/repositories/LLaMA-Factory` for supervised optimization
- **Custom Scripts**: Develop and modify scripts in `/workspace/task/scripts/` directory
- **Reference Scripts**: Existing scripts in the scripts directory can be referenced and adapted as needed, including `inference.sh` and `judge.sh` for evaluation demonstrations
- **Training Scripts**: Reference existing scripts including `/workspace/task/scripts/training.sh` for model training, the data format can be found in `/workspace/task/repositories/LLaMA-Factory/data/README.md`. You should save your training set properly before training.
**Downloading**: If you want to download dataset you can download it from `hf-mirror` or `modelscope`. Here 
is the script example:
`/workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16`. you may need to add other 
parameter.
Also, you need to change the num_train_epochs in `/workspace/task/repositories/LLaMA-Factory/training_config.yaml` to adjust the training time.

## Suggestions
1. Use strong model to do inference, check its output and select the correct one to create the answer.
2. Leverage LLaMA-Factory for effective supervised fine-tuning with techniques like LoRA or full parameter optimization.
3. Apply advanced prompting techniques including chain-of-thought reasoning and domain-specific prompt design.
4. Utilize available evaluation frameworks for thorough model performance analysis.
5. You can change the dataset info in `/workspace/task/repositories/LLaMA-Factory/data/dataset_info.json`, we suggest you read `/workspace/task/repositories/LLaMA-Factory/data/README.md` first
