---
license: cc-by-sa-4.0
task_categories:
- multiple-choice
- question-answering
- text-retrieval
- text-classification
language:
- en
tags:
- math
- olympiad-math
- inequality
- multi-choice
- open-ended
- fill-in-the-blank
- proofs
- natural-language-proofs
- question-answering
- arithmetic-reasoning
- algebraic-reasoning
- logical-reasoning
- math-reasoning
- multi-step-reasoning
- step-by-step-solution
pretty_name: IneqMath
size_categories:
- 1K<n<10K
configs:
- config_name: default
  data_files:
  - split: dev
    path: parquet/dev*
  - split: test
    path: parquet/test*
  - split: train
    path: parquet/train*
    
---
<div align="center">

  <img src="./assets/huggingface.png" alt="IneqMath Logo" width="120"/>

  <h1 style="font-size: 40px; margin-bottom: 0;"><strong>IneqMath</strong></h1>

  <h2 style="font-weight: bold; margin-top: 10px;">
    A Benchmark for Informal, Verifiable Reasoning in Olympiad-Level Inequality Proofs
  </h2>

  <p>
    <a href="https://ineqmath.github.io/">🌐 Project</a> |
    <a href="https://github.com/lupantech/ineqmath">💻 Github (TODO)</a>
  </p>

</div>


# Introduction
IneqMath is a benchmark for evaluating large language models (LLMs) on informal but verifiable inequality proving. Centered on Olympiad-level algebraic inequalities, it challenges models to not only produce correct final answers but also construct step-by-step solutions that apply theorems appropriately, justify symbolic transformations, and estimate tight bounds. Problems are framed in natural language and decomposed into two automatically checkable subtasks—bound estimation and relation prediction—allowing fine-grained assessment of reasoning accuracy beyond surface-level correctness.

# Dataset Overview
The table below provides the statistics of **IneqMath**, along with the bound and relation subtasks.
<center>
  <table 
    align="center" 
    width="60%" 
    border="1" 
    cellspacing="0" 
    cellpadding="6"
    style="width:60%; table-layout: fixed; border-collapse: collapse; text-align: center;">
    <colgroup>
      <col width="64%">
      <col width="12%">
      <col width="12%">
      <col width="12%">
    </colgroup>
    <thead>
      <tr>
        <th style="text-align:left;">Statistic</th>
        <th>Number</th>
        <th>Bnd.</th>
        <th>Rel.</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td style="text-align:left;"><b>Theorem categories</b></td>
        <td>29</td>
        <td>–</td>
        <td>–</td>
      </tr>
      <tr style="border-bottom:2px solid #000;">
        <td style="text-align:left;"><b>Named theorems</b></td>
        <td>83</td>
        <td>–</td>
        <td>–</td>
      </tr>
      <tr>
        <td style="text-align:left;"><b>Training problems (for training)</b></td>
        <td>1252</td>
        <td>626</td>
        <td>626</td>
      </tr>
      <tr>
        <td style="text-align:left;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- With theorem annotations</td>
        <td>962</td>
        <td>482</td>
        <td>480</td>
      </tr>
      <tr>
        <td style="text-align:left;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- With solution annotations</td>
        <td>1252</td>
        <td>626</td>
        <td>626</td>
      </tr>
      <tr>
        <td style="text-align:left;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Avg. solutions per problem</td>
        <td>1.05</td>
        <td>1.06</td>
        <td>1.05</td>
      </tr>
      <tr style="border-bottom:2px solid #000;">
        <td style="text-align:left;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Max solutions per problem</td>
        <td>4</td>
        <td>4</td>
        <td>4</td>
      </tr>
      <tr>
        <td style="text-align:left;"><b>Dev problems (for development)</b></td>
        <td>100</td>
        <td>50</td>
        <td>50</td>
      </tr>
      <tr>
        <td style="text-align:left;"><b>Test problems (for benchmarking)</b></td>
        <td>200</td>
        <td>96</td>
        <td>104</td>
      </tr>
    </tbody>
  </table>
</center>

The chart below shows the distribution of theorem categories.

<div align="center">

  <img src="./assets/theorem_category_pie_chart.png" alt="IneqMath Logo" width="520"/>

</div>

# Leaderboard

🏆 The leaderboard for the **IneqMath** is available [here](https://huggingface.co/spaces/AI4Math/IneqMath-Leaderboard).

Evaluation performance of some selected chat and reasoning LLMs on the **IneqMath** benchmark (the test set) are shown below. Please see **IneqMath** [paper](https://www.google.com) for more details.

<div align="center">

  <img src="./assets/teble_main_results.png" alt="main_results_table" width="800"/>

</div>

<p>
  In the table, <em>Bnd.</em> denotes bound problems and <em>Rel.</em> denotes relation ones. We report: (1) <em>Overall Acc</em>, which reflects the correctness of both the final answer and intermediate steps; (2) <em>Answer Acc</em>, which measures final answer correctness alone; and (3) <em>Step Acc</em>, which evaluates the accuracy of intermediate steps across four error categories—<em>Toy Case</em>, <em>Logical Gap</em>, <em>Numerical Approximation</em>, and <em>Numerical Calculation</em>. <span style="color:blue;">Blue superscripts ↓</span> indicate accuracy drop (<em>Overall Acc</em> – <em>Answer Acc</em>) from step-wise errors. <u>Underlining</u> denotes best result within each model category; <strong>boldface</strong> highlights best overall performance. Default max token limit for reasoning LLMs is 10K. 
</p>

# Dataset Usage
## Load dataset in Python
You can download this dataset by the following command (make sure that you have installed [Huggingface Datasets](https://huggingface.co/docs/datasets/quickstart)):

```python
from datasets import load_dataset
dataset = load_dataset("AI4Math/IneqMath")
```

Here are some examples of how to access the downloaded dataset:

```python
# print the first data on the training set
print(dataset["train"][0])
# print the first data on the test set
print(dataset["test"][0])
# print the first data on the dev set
print(dataset["dev"][0])
```

## Download json form dataset
You can also download the whole json form dataset by running the following command:
```shell
wget https://huggingface.co/datasets/AI4Math/IneqMath/resolve/main/json/all.tar.gz
```
Then, please uncommpress the file:
```shell
tar -zxvf all.tar.gz
cd json
```

The file structure of the uncommpressed file is as follows:
<details>
<summary>
Click to expand the file structure
</summary>

```
json
├── train.json # Train set
├── test.json # Test set
├── dev.json # Dev set
└── theorems.json # Theorems set

```

</details>

## Data Format
The dataset is provided in json format and contains the following attributes:

```json
{
    "data_id": [integer] The ID of the data of each split,
    "problem": [string] The question text,
    "type": [string] The type of question: ‘relation’ or 'bound',
    "data_split": [string] Data split: 'train', 'test' or 'dev',
    "answer": [string] The correct answer of the problem,
    "solution": [string] Step by step solution of the problem,
    "theorems": [Dictionary] A dictionary of manually annotated theorems that are relevant or expected to be used in solving the problem. Each theorem has key and value shown below:              
        Theorem_id: [string] The ID of the theorem. For example, 'Theorem_1' is the ID for the first theorem
            {
            "Nickname": [list] A list of nicknames of the theorem,
            "Theorem": [string] The content of the theorem,
            "Theorem_Category": [string] the category of the theorem
            }
    "choices": [list] A list of choices of the multi-choice relation problem. If the problem type is 'bound', choices would be null.
}
```

The theorem set is provided in json format and contains the following attributes:
```json
Theorem_id: [string] The ID of the theorem. For example, 'Theorem_1' is the ID for the first theorem
            {
            "Nickname": [list] A list of nicknames of the theorem,
            "Theorem": [string] The content of the theorem,
            "Theorem_Category": [string] the category of the theorem
            }
```

# Dataset Examples
Training examples of **IneqMath**:
<div align="center">
    <img src="assets/train_bound_example.png" width="650" alt="Train Bound Example">
    <img src="assets/train_relation_example.png" width="650" alt="Train Relation Example">
</div>

Testing examples of **IneqMath**:

<div align="center">
    <img src="assets/test_bound_example.png" width="650" alt="Test Bound Example">
    <img src="assets/test_relation_example.png" width="650" alt="Test Relation Example">
</div>

# LLM Judge Performance

Confusion matrices and performance metrics of our 5 LLM-as-Judges are shown below, which exhibit strong agreement with human labels.

<div align="center">

  <img src="./assets/confusion_matrix_judge_metrix.png" alt="judge_confusion_matrix" width="800"/>
  <img src="./assets/table_judge_metrics.png" alt="table_judge_matrix" width="650"/>

</div>

# Scaling law in model size
The following two figures show how <em>final-answer accuracy</em> (which evaluates only the correctness of the final predicted answer) and <em>overall accuracy</em> (which requires both a correct answer and valid intermediate reasoning steps) scales with model size for LLMs.

<div align="center">

  <img src="./assets/scaling_law_model_size_answer_acc_log_all.png" alt="scaling_curve_answer_acc" width="700"/>
  <img src="./assets/scaling_law_model_size_overall_acc_log_all.png" alt="scaling_curve_overall_acc" width="700"/>

</div>

# License

The new contributions to our dataset are distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

The copyright of the images and the questions belongs to the original authors. Alongside this license, the following conditions apply:

- **Purpose:** The test split was primarily designed for use as a test set.
- **Commercial Use:** The test split can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

# Citation

If you use the **IneqMath** dataset in your work, please kindly cite the paper using this BibTeX:

```
TODO
```



