## Motivation

The quality of training data is crucial for developing effective pre-trained models. Large language model pre-training relies on cleaning web corpora. This assignment focuses on data cleaning strategies to improve the quality of raw web data.

## Task

You will be provided with 100,000 raw web data entries from various sources. Your task is to design and implement a systematic approach to clean this web data. You need to design a clear methodology and develop your own cleaning scripts to process documents and remove noise while maintaining data consistency and improving data quality.

You should work in the `/workspace/task` and `/workspace/data/outputs` directories. First, you need to read the raw web text data `raw_web_data.jsonl`, and finally save the cleaned data in `/workspace/data/outputs/result_web_data.jsonl`.

You can submit your answer in the file above for 1 time. You should try your best to get highest score. 

## Data

#### Dataset

- `raw_web_data.jsonl`: 100,000 raw web data entries from various sources, where each entry is noisy web-scraped text.
  - Fields: "id", "text"
  - Location: `/workspace/data/dataset/raw_web_data.jsonl`

#### Model

- **Qwen**:
    - Size: 1.5B, 7B, 32B
    - Location: All models are located in `/workspace/data/checkpoints/Qwen`


## Constraints

- The output file should contain "id" and "text" fields, with "id" fields corresponding between input and output.
- GPU Usage: You can use 8 80G GPUs to clean web data.
- During data cleaning, text content should be preserved as much as possible. **The "text" field after cleaning should not be empty.**
- Data cleaning processing time limit: 5 hours on single machine. You have only one attempt, so please ensure you can complete the task within the specified time.

## Output Format

The output file `result_web_data.jsonl` should follow this exact format:
```json
{"id": example_id_1, "text": "cleaned text content here"}
{"id": example_id_2, "text": "another cleaned text content"}
```

Requirements:
- Each line must be a valid JSON object
- Each object must contain exactly two fields: "id" and "text"
- The "id" field must match the corresponding entry in the input file (should be a int)
- The "text" field must not be empty after cleaning
- All 100,000 entries must be cleaned, do not delete any data.
- File must be saved at: `/workspace/data/outputs/result_web_data.jsonl`

## Evaluation

Evaluation Metrics:

1. **Correct Format and Location**: The `result_web_data.jsonl` output file has the correct format and location.
2. **Dataset Size Validation**: All entries in `result_web_data.jsonl` have non-empty text fields.
3. **Judge Model Evaluation**: Using the `Judge-By-the-LLM` method to score the quality of a subset (1-5 points). The cleaned data should perform better than several baselines on the evaluation model.

Success Criteria:

The final score is calculated using linear interpolation based on your Judge Model score compared to the baselines:

- **Format Validation (5 points)**: `result_web_data.jsonl` output file has correct format and location, and passes dataset size validation
- **Quality Score (95 points)**: Linear interpolation between baseline scores:
  - Score ≤ Baseline_1 (3.0): 0 points
  - Baseline_1 < Score ≤ Baseline_2 (3.5): 0-25 points (linear interpolation)
  - Baseline_2 < Score ≤ Baseline_3 (4.0): 25-95 points (linear interpolation)
  - Score > Baseline_3: 95 points

Reference Baselines:
- Baseline_1 (Raw data): 3.0 points
- Baseline_2 (Simple cleaning): 3.5 points  
- Baseline_3 (Strong cleaning): 4.0 points

## Environment

The environment has been pre-configured for you in `/workspace/conda`. You can start working directly without additional setup. You can use vllm, datatrove etc. in this environment.