## Motivation

Datasets play a crucial role in scientific research. With the advancement of AI engineering capabilities, it becomes critical to evaluate how well AI engineers can find datasets that meet specific requirements and adapt them for model training. This task focuses on text summarization as an example domain to assess these capabilities.

The challenge lies in both discovering existing datasets and synthesizing new data when needed, ensuring that the acquired or generated data can effectively improve model performance on downstream tasks.

## Task

Your task is to work with datasets for text summarization and fine-tune a model to improve its performance. You need to:

1. **Dataset Discovery**: Search for existing public datasets that match the specified criteria
2. **Data Synthesis**: Create high-quality synthetic data samples that can be used for model fine-tuning
3. **Data Processing**: Format all discovered and synthesized data for model fine-tuning
4. **Model Fine-tuning**: Use the curated dataset to fine-tune a Llama-3.1-8B-Instruct model with full parameter fine-tuning
5. **Performance Validation**: Evaluate the fine-tuned model and generate inference results

The specific dataset requirements for this text summarization task are:


**Target Dataset Criteria:**
- **Domain**: Politics
- **Input**: English news articles (often preceded by the prompt "Summarize the following news article in one sentence:")
- **Output**: English one-sentence summaries of the articles
- **Source**: Real-world, human-generated (no synthetic data for existing datasets)
- **Dataset scale**: Approximately 1000+ news article/summary pairs

You should work under the directory `/workspace/task` and `/workspace/data`.

You need to implement either dataset discovery or data synthesis approaches. After obtaining the data, convert it into a format suitable for fine-tuning. The dataset must be in JSON format with at least input and output fields, where input includes the instruction.

For fine-tuning, you should use full parameter fine-tuning (not LoRA) with the Llama-3.1-8B-Instruct model.

After fine-tuning, use your trained model to generate inference results on the test set and save them to the specified output location.


You can submit your answer in the file above for 3 times (with different reasoning workflow and its corresponding inference result). You should try your best to get highest score. 

## Data

### Model Checkpoint
The Llama-3.1-8B-Instruct model checkpoint is stored in `/workspace/data/checkpoints/`. (`/workspace/data/checkpoints/Meta-Llama-3.1-8B-Instruct`)

### Test Sets
- **Test Set**: Located at `/workspace/data/datasets/test.json`. This contains the remaining test data without ground truth answers. You need to generate predictions for this set.

It contain JSON lists where each element is a dictionary with:
- `input`: Instruction containing summarization directive and original text content
- `output`: empty

### Data Format Requirements
All discovered or synthesized data must be formatted for direct use with model fine-tuning. The format should be compatible with standard instruction-following datasets.

Whether you search for or synthesize datasets, you need to organize your final dataset into JSON files. These JSON files should contain a JSON list where each element is a dictionary with two keys: `input` and `output`. The `input` should contain the instruction and input for Llama-3.1-8B-Instruct, and the `output` should contain the expected output for fine-tuning. The organized JSON files should be placed in `/workspace/data/datasets/` with the names `search_set.json` and `generation_set.json`.(Both of them can be empty, but you should create them.) And then combined them into `/workspace/data/datasets/training_data.json`.

After processing the datasets, use the following script for fine-tuning:
```
cd /workspace/task/repositories/LLaMA-Factory
llamafactory-cli train /workspace/task/scripts/config.yaml
```
(You can change the hyperparameter in /workspace/task/scripts/config.yaml if you want)

## Constraint

- You must work under the directory `/workspace/task` and `/workspace/data`
- You can implement both dataset discovery and data synthesis approaches and must implement at least one of them.
- For dataset search: Focus on publicly available datasets from HuggingFace
- Use full parameter fine-tuning (not LoRA)
- Use Llama-3.1-8B-Instruct model for fine-tuning
- Time constraint: Complete this task efficiently within 48 hours
- GPU Usage: You can use 8 80G GPUs for training/inference

## Evaluation

### Inference Requirements
After fine-tuning your model, you need to:
1. Generate predictions for the test set (`/workspace/data/datasets/test.json`), using `/workspace/task/scripts/inference.py` (You may need to change some hyperparameters in the script), and save the inference results to `/workspace/data/outputs/predictions.json`
2. The output format should match the expected JSON structure with predictions for each test sample, a list of json, like:
```json
{
    "input": "Summarize the ...  (This version of the story , corrects to remove “denounces” in the headline and rephrases; in third paragraph, removes incorrect reference to Oracle role in H1-B visa program) ",
    "output": "Oracle Co-CEO questions policies on student visas"
  },
```
(You should not change the original input, this example is just for reference)
The order and number of the data in `/workspace/data/outputs/predictions.json` should obey the order and number of the data in `/workspace/data/datasets/test.json`

### Performance Requirements
The fine-tuned model must demonstrate significant improvement over the baseline (non-fine-tuned model). Specifically:

- The final score is calculated based on the rouge of the test set.
- **Higher improvement margins are preferred**

### Evaluation Metrics
ROUGE scores will be computed using the evaluation script located at `/workspace/task/inference.py`.

### Evaluation Process
1. **Baseline Evaluation**: Test the original Llama-3.1-8B-Instruct model on the test set (baseline ROUGE score: 0.04)
2. **Fine-tuned Model Evaluation**: Test the fine-tuned model on the same test set (fine-tuned ROUGE score: 0.99)
3. **Comparison**: Calculate percentage improvement in ROUGE scores
4. **Success Criteria**: ROUGE metric must show ≥10% improvement over baseline

The evaluation will be conducted in the eval workspace using your generated predictions to determine task success.

## Environment

We have set up and activated the conda environment `/workspace/conda` with all necessary dependencies.

## Script

If you want to download dataset you can download it from `hf-mirror` or `modelscope`. Here is the script example:
`/workspace/task/scripts/hfd.sh dataset_name --dataset --tool aria2c -x 16`. you may need to add other parameter.