# Live MCP Benchmark

A comprehensive benchmark framework for evaluating AI agents' ability to use MCP (Model Context Protocol) tools to complete real-world tasks.

## Overview

This benchmark evaluates how well AI agents can:
- Understand complex, multi-step task instructions
- Select appropriate tools from a large MCP pools
- Execute MCPs in the correct order
- Produce accurate and well-formatted outputs

## Dataset

The benchmark contains **101 tasks** across various difficulty levels:
- **Easy**: simple multi-step tasks
- **Medium**: Tasks requiring multiple tools with dependencies
- **Hard**: Complex tasks with complex reasoning and data processing

Each task includes:
- A natural language query with realistic context
- Required tools for completion
- Execution plan with expected tool chain
