# Kompete-bench

Kompete-bench is a new benchmark for evaluating multi-agent AutoML systems.

## ⚙ Methodology

The Kompete-bench benchmark consists of two parts that are balanced to combine historical and modern tasks.

1.  **A curated set of existing competitions:** This part includes 15 competitions from the MLE-Bench "lite" set that still accept late submissions on Kaggle. The size of individual datasets does not exceed 1 GB, and the total volume of this collection is 5.3 GB, providing a stable basis for comparison.
2.  **Inclusion of new competitions:** To reflect the changing landscape of AutoML tasks, 11 new competitions from 2024 and 2025 have been added. The total size of these datasets is 4.9 GB. They were selected to ensure fair comparison with humans, as both participants and modern models have access to the same tools and libraries.

#### Summary of Tasks and Metrics in Kompete-bench
| Name | Number of participants | Metric | Bronze | Silver | Gold | Part |
|------|-------------------------|--------|--------|--------|------|------|
| aerial-cactus-identification | 1221 | ROC-AUC ↑ | 1 | 1 | 1 | MLE-bench (Lite) |
| denoising-dirty-documents | 162 | RMSE ↓ | 0.04517 | 0.02609 | 0.01794 | MLE-bench (Lite) |
| dog-breed-identification | 1281 | log loss ↓ | 0.04598 | 0.00539 | 0.0005 | MLE-bench (Lite) |
| dogs-vs-cats-redux-kernels-edition | 1315 | log loss ↓ | 0.06127 | 0.05038 | 0.03882 | MLE-bench (Lite) |
| jigsaw-toxic-comment-classification-challenge | 4539 | mean col-wise ROC AUC ↑ | 0.98639 | 0.98668 | 0.98740 | MLE-bench (Lite) |
| leaf-classification | 1596 | log loss ↓ | 0.01526 | 0.00791 | 0.00000 | MLE-bench (Lite) |
| mlsp-2013-birds | 81 | ROC-AUC ↑ | 0.87372 | 0.90038 | 0.93527 | MLE-bench (Lite) |
| nomad2018-predict-transparent-conductors | 879 | RMSLE ↓ | 0.06582 | 0.06229 | 0.05589 | MLE-bench (Lite) |
| plant-pathology-2020-fgvc7 | 1318 | ROC-AUC ↑ | 0.97361 | 0.97465 | 0.97836 | MLE-bench (Lite) |
| random-acts-of-pizza | 462 | ROC-AUC ↑ | 0.6921 | 0.76482 | 0.97908 | MLE-bench (Lite) |
| spooky-author-identification | 1242 | log loss ↓ | 0.29381 | 0.26996 | 0.16506 | MLE-bench (Lite) |
| tabular-playground-series-dec-2021 | 1189 | ROC-AUC ↑ | 0.95658 | 0.95658 | 0.9566 | MLE-bench (Lite) |
| tabular-playground-series-may-2022 | 1152 | ROC-AUC ↑ | 0.99818 | 0.99822 | 0.99823 | MLE-bench (Lite) |
| text-normalization-challenge-english-language | 261 | accuracy ↑ | 0.99038 | 0.99135 | 0.99724 | MLE-bench (Lite) |
| text-normalization-challenge-russian-language | 163 | accuracy ↑ | 0.97592 | 0.98232 | 0.99012 | MLE-bench (Lite) |
| eedi-mining-misconceptions-in-mathematics | 1449 | MAP@25 ↑ | 0.46090 | 0.49136 | 0.56429 | Contemporary |
| learning-agency-lab-automated-essay-scoring-2 | 2708 | quadratic weighted kappa ↑ | 0.83471 | 0.83518 | 0.83583 | Contemporary |
| lmsys-chatbot-arena | 1688 | log loss ↓ | 1.00472 | 0.99410 | 0.98392 | Contemporary |
| pii-detection-removal-from-educational-data | 2049 | efficiency score ↑ | 0.95714 | 0.95883 | 0.96615 | Contemporary |
| um-game-playing-strength-of-mcts-variants | 1610 | RMSE ↓ | 0.43050 | 0.42973 | 0.42591 | Contemporary |
| llm-prompt-recovery | 2176 | Sharpened Cosine Similarity ↑ | 0.6375 | 0.6513 | 0.6848 | Contemporary |
| equity-post-HCT-survival-predictions | 3327 | C-index ↑ | 0.69288 | 0.69320 | 0.69500 | Contemporary |
| cmi-detect-behavior-with-sensor-data | 2156 | F1 ↑ | 0.84 | 0.84 | 0.86 | Contemporary |
| make-data-count-finding-data-references | 833 | F1 ↑ | 0.548 | 0.564 | 0.620 | Contemporary |
| neurips-open-polymer-prediction-2025 | 1539 | wMAE ↓ | 0.057 | 0.041 | 0.032 | Contemporary |
| wsdm-cup-multilingual-chatbot-arena | 890 | categorization accuracy ↑ | 0.696381 | 0.702772 | 0.711412 | Contemporary |


## 📊 Evaluation Metrics
Kompete-bench evaluates performance using real Kaggle leaderboards. It reports the **percent humans beaten** — the proportion of participants that a system outperforms in a given competition. Formally, this metric is defined as:

$$
PercentHumansBeaten = \frac{N - R}{N}
$$

where:

* `N` is the total number of human participants in the competition,
* `R` is the agent’s rank on the leaderboard (lower is better).

The resulting value ranges from 0.0 to 1.0, with higher values indicating stronger relative performance.

## Setup
To get started with Kompete-bench, follow these steps:

1. **Install required packages**
```bash
pip install pandas kaggle pyarrow nbformat
```

2. **Configure Kaggle API access**

   * Download your `kaggle.json` API key from [your Kaggle account settings](https://www.kaggle.com/account).
   * Place the file in the correct location (e.g., `~/.kaggle/kaggle.json` on Unix-based systems or `C:\Users\<username>\.kaggle\kaggle.json` on Windows).
   * Ensure proper permissions (especially on Unix):

     ```bash
     chmod 600 ~/.kaggle/kaggle.json
     ```

3. **Accept competition rules**

   You must manually accept the terms of use for each competition on Kaggle before downloading datasets. Visit the competition page and click "Join Competition" or "Accept Rules".

4. **Prepare the benchmark datasets**

   Run the following script to automatically download and prepare all necessary data:

   ```bash
   python download_and_prepare_benchmark.py
   ```


### Downloading Competitions

To get started, you need to download all the competitions and their data. Use the following command:

```bash
python dowload_and_prepare_benchmark.py
```

This command automatically downloads and preprocesses all competition data into a format that's convenient for agents. A file named **`overview.txt`** containing a description of the competition and its data will be automatically saved in the competition's directory.

-----

### Submitting Solutions

Submitting solutions to competitions can be done in two ways: uploading a CSV file or a Jupyter Notebook.

To submit your solutions, place the necessary files (`.csv` and `.py`) in the **`kaggle_submissions`** directory. Then, run the following command:

```bash
python kaggle_submissions.py ./kaggle_submissions your_kaggle_username
```

This script will automatically do the following:

  - **For CSV submissions:** It will push the files to Kaggle and get a score.
  - **For Notebook submissions:** It will generate a notebook from your `.py` file. The only change you need to make in your agent's code is to replace the path to the data with `/kaggle/working/{competition_name}`. No other changes should be made to the agent's code.

After the notebook is executed on Kaggle, you will be prompted to enter the final metric in the terminal. Once all submissions are complete, a **`competition_report.json`** file will be created, detailing the metrics achieved for each competition.
