# DeepResearch Benchmark

A benchmark for evaluating model performance on deep research tasks. This benchmark tests a model's ability to generate research articles and scores them against reference articles.

## Environment Setup

### Setup Steps

1. Create and activate a virtual environment (recommended):

   ```bash
   # Using venv (built into Python)
   python -m venv env
   source env/bin/activate  # On Linux/Mac
   # Or
   .\env\Scripts\activate   # On Windows
   
   # Or using conda
   conda create -n deepresearch python=3.9
   conda activate deepresearch
   ```

2. Install dependencies:

   ```bash
   cd deep_research_bench/supplementary_materials
   pip install -r requirements.txt
   ```

3. Configure API Key:
   
   Before running, make sure you have set up the correct API key in `utils/api.py` as described below.

## Project Structure

Main directories and files:

```
supplementary_materials/
├── data/
│   ├── criteria_data/      # Evaluation criteria data
│   ├── prompt_data/        # Test prompts
│   └── test_data/          # Test data
│       ├── cleaned_data/   # Cleaned article data
│       └── raw_data/       # Raw article data
├── prompt/                 # Prompt templates
├── utils/                  # Utility functions
├── ablation_study/         # Ablation experiment files
├── other_materials/        # Additional resources
│   ├── interface_screenshot_*.png  # UI screenshots of the evaluation interface
│   └── labeler_instruction.md      # Instructions for human labelers/annotators
├── deepresearch_bench.py   # Main benchmark script
└── run_benchmark.sh        # Shell script to run main benchmark
```

## Running the Main Benchmark

The main benchmark evaluates the quality of research articles generated by models, comparing them with reference articles.

### Steps:

1. Ensure your model outputs are in the correct location:
   ```
   supplementary_materials/data/test_data/raw_data/<model_name>.jsonl
   ```
   
   Your data should respond to prompts in `supplementary_materials/data/prompt_data/query.jsonl`.

2. Configure the API key:
   In `utils/api.py`, set your API key (Gemini model is used by default):
   ```python
   API_KEY = "[Your API Key]"
   ```

3. **Specify the model(s) to evaluate**:
   Edit the `TARGET_MODELS` array in `run_benchmark.sh` to include your model name(s):
   ```bash
   # Target model name list
   TARGET_MODELS=("your-model-name" "another-model-name")
   ```

4. Run the benchmark:
   ```bash
   cd supplementary_materials
   bash run_benchmark.sh
   ```

   You can also run the Python script directly for a single model:
   ```bash
   python deepresearch_bench.py "your-model-name"
   ```

### Optional Parameters:

In `run_benchmark.sh`, you can modify the following parameters:
- `TARGET_MODELS`: Array of target model names to evaluate
- `LIMIT`: Limit the number of prompts to process (for testing)
- `SKIP_CLEANING`: Skip article cleaning step
- `ONLY_ZH`: Only process Chinese data
- `ONLY_EN`: Only process English data

Example of modifying parameters in `run_benchmark.sh`:
```bash
# Uncomment and modify these lines to use the parameters
TARGET_MODELS=("your-model-name")
LIMIT="--limit 10"
# SKIP_CLEANING="--skip_cleaning"
# ONLY_ZH="--only_zh"
```

Or when running the Python script directly:
```bash
python deepresearch_bench.py "your-model-name" --limit 10 --only_zh
```

## Running Ablation Experiments

Ablation experiments study how different evaluation components affect the final scores.

### Steps:

1. Ensure model outputs exist:
   ```
   supplementary_materials/data/test_data/raw_data/<model_name>.jsonl
   ```

2. Navigate to the ablation_study directory:
   ```bash
   cd supplementary_materials/ablation_study
   ```

3. Run ablation experiments:
   ```bash
   bash run_ablation.sh
   ```

### Experiment Settings:

`run_ablation.sh` defines several ablation experiment settings:
- `Baseline`: Baseline (dynamic criteria, reference comparison, using weights)
- `No_Weights`: No weights (dynamic criteria, reference comparison)
- `Pointwise`: Point-wise scoring (dynamic criteria, using weights)
- `Static_Criteria_Merged`: Static merged criteria (reference comparison, using weights)
- `Vanilla_Prompt`: Simple prompt
- Other combination settings

You can customize experiments by modifying the `TARGET_MODELS` and `SETTINGS` arrays in `run_ablation.sh`.

## Using Your Own Model

To test your own model:

1. Prepare your model response data:
   - Ensure your model responds to all prompts in `data/prompt_data/query.jsonl`
   - Each response should include the prompt ID and the generated article content

2. Place your data at:
   ```
   supplementary_materials/data/test_data/raw_data/<your-model-name>.jsonl
   ```

3. Run the benchmark:
   ```bash
   python deepresearch_bench.py "your-model-name"
   ```

4. Results will be saved to:
   ```
   supplementary_materials/results/<your-model-name>.jsonl
   ```

## Outputs and Results

- Main benchmark results will be saved to `supplementary_materials/results/<model_name>.jsonl`
- Ablation experiment results will be saved to `supplementary_materials/ablation_study/results/<settings>/<model_name>.jsonl`
- Detailed logs will be printed to console and the `output.log` file 