# MCPMark

> For submission we removed external links, citation references, CHHANGELOGs, Contributing, some scripts (for example, `./src/aggregators` because it consist of our automated log pushing logic, it only serves as paser) and bundled documentation to avoid revealing contributor information. Besides, we anonymize some links in our tasks using `xxx`. We also use CDN for seemlessly downloading our intital states. For this part, we also use `xxx` to anonymize them in our code. For Docker image, we  use `xxx` to anonymize as well.

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

## What you can do with MCPMark

- **Evaluate real tool usage** across multiple MCP services: `Notion`, `GitHub`, `Filesystem`, `Postgres`, `Playwright`.
- **Use ready-to-run tasks** covering practical workflows, each with strict automated verification.
- **Reliable and reproducible**: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
- **Unified metrics and aggregation**: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
- **Flexible deployment**: local or Docker; fully validated on macOS and Linux.

---

## Quickstart (5 minutes)

Before starting, make sure you already have a local copy of this repository.

### Configure environment variables (create `.mcp_env` at repo root)
Only set what you need. Add service credentials when running tasks for that service.

```env
# Example: OpenAI
OPENAI_BASE_URL="<your-openai-base-url>"
OPENAI_API_KEY="<your-openai-api-key>"

# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"
```

The previously bundled docs (e.g., `docs/introduction.md`) were removed for submission; refer to an earlier checkout if available.

### Install and run a minimal example

Local (Recommended)
```bash
pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install
```

Docker
```bash
./build-docker.sh
```

Run a filesystem task (no external accounts required):
```bash
python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classification
```

Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).

---

## Run your evaluations

### Single run (k=1)
```bash
# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1
```

### Multiple runs (k>1) for pass@k
```bash
# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp
```

### Run with Docker
```bash
# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker
```

Docs have been removed for submission, so the latest supported model list is not included in this package.

Tip: MCPMark supports **auto-resume**. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see [RETRYABLE_PATTERNS](src/errors.py)) are retried automatically. Models may emit different error strings. If you encounter a new resumable error, please open a PR or issue.

---

## Service setup and authentication

| Service     | Setup summary                                                                                                  | Docs                                  |
|-------------|-----------------------------------------------------------------------------------------------------------------|---------------------------------------|
| Notion      | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification.     | Guide removed in submission           |
| GitHub      | Multi-account token pooling recommended; import pre-exported repo state if needed.                              | Guide removed in submission           |
| Postgres    | Start via Docker and import sample databases.                                                                   | Setup notes removed in submission     |
| Playwright  | Install browsers before first run; defaults to `chromium`.                                                      | Setup notes removed in submission     |
| Filesystem  | Zero-configuration, run directly.                                                                               | Config notes removed in submission    |

The quickstart guide (`docs/quickstart.md`) was removed for submission; keep an earlier clone if you need it.

---

## Results and metrics

- Results are organized under `./results/{exp_name}/{model}__{mcp}/run-*/` (JSON + CSV per task).
- Generate a summary with:
```bash
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
```
- Only models with complete results across all tasks and runs are included in the final summary.
- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

---

## Model and Tasks
- **Model support**: MCPMark calls models via LiteLLM. Consult the public LiteLLM documentation as needed. For Anthropic (Claude) extended thinking mode (enabled via `--reasoning-effort`), we use Anthropic's native API.
- Historical references to `docs/introduction.md` described supported models; that directory is intentionally omitted in the submission package.
- To add a new model, edit `src/model_config.py`. Confirm LiteLLM provider support using the public LiteLLM resources.
- Task design principles were documented in `docs/datasets/task.md`. Each task still ships with an automated `verify.py` for objective, reproducible evaluation, and the earlier `docs/task.md` file covered the rationale.

---

## License

This project is licensed under the Apache License 2.0 - see `LICENSE`.
