# DDR_Bench

**Deep Data Research Benchmark**: A benchmark for evaluating LLM agents on autonomous data exploration and analysis tasks.

## Overview

DDR_Bench provides a framework for running and evaluating LLM-based data analysis agents across three real-world domains:

| Scenario | Domain | Data Type | Task |
|----------|--------|-----------|------|
| **MIMIC** | Healthcare | MIMIC-IV patient records | Extract clinical insights from patient notes |
| **10-K** | Finance | SEC 10-K filings | Analyze company financial reports |
| **GLOBEM** | Behavioral | GLOBEM wellness data | Discover patterns in user behavioral data |

## Installation

### Requirements

- Python 3.10+
- Required packages:

```bash
pip install pyyaml aiohttp google-genai openai mcp
```

### Setup

1. Copy and configure the configuration file:
```bash
cp config.yaml config.local.yaml
# Edit config.local.yaml with your paths
```

2. Set up API keys as environment variables:
```bash
export GEMINI_API_KEY="your-gemini-api-key"
export AZURE_OPENAI_API_KEY="your-azure-openai-key"
export AZURE_OPENAI_ENDPOINT="your-azure-endpoint"
# Or other provider keys as needed
```

3.  Running the Agent

Use `run_agent.py` to run data analysis agents. Since data paths are pre-configured in `config.yaml`, you typically only need to specify the scenario and model:

#### MIMIC Scenario (Patient Analysis)

```bash
python run_agent.py --scenario mimic \
    --provider gemini \
    --model gemini-2.5-flash
```

#### 10-K Scenario (Company Analysis)

```bash
python run_agent.py --scenario 10k \
    --provider vllm \
    --model Qwen/Qwen2.5-7B-Instruct \
    --vllm-port 8000
```

#### GLOBEM Scenario (User Analysis)

```bash
python run_agent.py --scenario globem \
    --provider gemini \
    --model gemini-2.5-flash
```

### Agent Options

| Option | Description |
|--------|-------------|
| `--scenario` | Required. One of: `mimic`, `10k`, `globem` |
| `--db-path` | Path to SQLite database (mimic, 10k) |
| `--data-path` | Path to data directory (globem) |
| `--input` | Path to input JSON file (mimic patient notes) |
| `--provider` | LLM provider: `gemini`, `vllm`, `minimax`, `openai` |
| `--model` | Model name to use |
| `--log-dir` | Output directory for results |
| `--target-ids` | Comma-separated IDs to process |
| `--overwrite` | Overwrite existing results |

### Running Evaluation

Use `run_evaluation.py` to evaluate agent results using LLM-as-judge. Like the agent runner, evaluation paths are also pulled from `config.yaml`:

```bash
# Evaluate MIMIC results
python run_evaluation.py --scenario mimic --log-dir ./logs/mimic

# Evaluate 10-K results
python run_evaluation.py --scenario 10k --log-dir ./logs/10k

# Evaluate GLOBEM results
python run_evaluation.py --scenario globem --log-dir ./logs/globem
```

### Evaluation Options

| Option | Description |
|--------|-------------|
| `--scenario` | Required. One of: `mimic`, `10k`, `globem` |
| `--qa-file` | Path to QA ground truth file |
| `--log-dir` | Path to agent output logs |
| `--output` | Output path for results JSON |
| `--provider` | LLM provider for judge: `azure`, `openai`, `vllm` |
| `--model` | Model for LLM-as-judge (default: gpt-5-mini) |

## Configuration

Configuration is managed via `config.yaml`:

```yaml
# Provider configuration
provider:
  default_provider: "gemini"
  default_model: "gemini-2.5-flash"
  vllm_base_url: "http://localhost:8000"
  vllm_port: 8000

# Agent configuration
agent:
  max_turns: 100
  max_retries: 2

# Scenario-specific paths
scenarios:
  mimic:
    db_path: "/path/to/external/mimic_iv.db"
    id_file: "./data/mimic/entity_ids.json"
    qa_file: "./data/mimic/qa.json"
    log_dir: "./logs/mimic"
  
  10k:
    db_path: "/path/to/external/10k.db"
    id_file: "./data/10k/entity_ids.json"
    qa_file: "./data/10k/qa.json"
    log_dir: "./logs/10k"
  
  globem:
    data_path: "/path/to/external/globem/data"
    id_file: "./data/globem/entity_ids.json"
    qa_file: "./data/globem/qa.json"
    log_dir: "./logs/globem"
```

### Environment Variables

| Variable | Description |
|----------|-------------|
| `GEMINI_API_KEY` | Google Gemini API key |
| `OPENAI_API_KEY` | OpenAI API key |
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key |
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL |
| `MINIMAX_API_KEY` | MiniMax API key |
| `VLLM_BASE_URL` | VLLM server URL |
| `VLLM_PORT` | VLLM server port |

## Project Structure

```
DDR_Bench/
├── agent/                  # Agent implementation
│   ├── data_agent.py       # Main autonomous agent
│   ├── llm_providers.py    # LLM provider router
│   ├── models.py           # Data models
│   ├── prompt_manager.py   # Prompt templates
│   └── providers/          # LLM provider implementations
│       ├── base.py         # Base provider class
│       ├── gemini_provider.py
│       ├── vllm_provider.py
│       ├── openai_provider.py
│       └── minimax_provider.py
├── tool_server/            # MCP tool servers
│   ├── base_server.py      # Base MCP server
│   ├── sqlite_mcp.py       # SQL database tools
│   ├── code_mcp.py         # Code execution tools
│   └── client.py           # MCP client
├── run_agent.py            # Unified agent runner
├── run_evaluation.py       # Unified evaluation script
├── config.py               # Configuration module
├── config.yaml             # Configuration template
├── base_batch_analyzer.py  # Batch analysis base class
├── evaluate_*.py           # Scenario-specific evaluators
└── README.md
```

## Supported LLM Providers

| Provider | Description | Environment Variable |
|----------|-------------|---------------------|
| **Gemini** | Google Gemini API | `GEMINI_API_KEY` |
| **VLLM** | Local VLLM server | `VLLM_BASE_URL`, `VLLM_PORT` |
| **OpenAI/Azure** | OpenAI or Azure OpenAI | `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT` |
| **MiniMax** | MiniMax API | `MINIMAX_API_KEY` |