---
license: cc-by-4.0
task_categories:
- question-answering
language:
- en
configs:
- config_name: default
  data_files:
  - split: test_easy
    path: "TabularGSM_Easy_csv.json"
  - split: test_medium
    path: "TabularGSM_Medium_csv.json"
  - split: test_hard
    path: "TabularGSM_Hard_csv.json"
  - split: test_robust
    path: "TabularGSM_Robustness_csv.json"
---
# 📊 TabularGSM

This repository provides the `Benchmark` dataset, a curated and structured collection of table-based math reasoning problems derived from the GSM8K dataset. It is conducted through designed for **standardized evaluation** and **fair comparison** of reasoning models in tabular contexts, especially under varied difficulty and robustness requirements. 

## 🧾Dataset Usage Options

We provide two usage options for the dataset:

### [1] Using CSV Files

We offer a set of JSON metadata files:

- `TabularGSM_easy_csv`
- `TabularGSM_medium_csv`
- `TabularGSM_hard_csv`
- `TabularGSM_robust_csv`

In these files, the `table` key corresponds to the path of a pre-processed CSV file. All the CSV files are included in the `csv_zip` archive.

Users can download the CSV files locally and then load the data using the `datasets` library for further use.

### [2] Using Serialized JSON Tables

We also provide another set of metadata files:

- `TabularGSM_easy`
- `TabularGSM_medium`
- `TabularGSM_hard`
- `TabularGSM_robustness`

In these files, the value of the `table` key is the serialized table directly in JSON format.

Users can download the corresponding JSON files and use them directly.



## 📦 Dataset Overview

The dataset is constructed through our custom `Pipeline`, using the GSM8K test set as the base. It consists of **~3,200 examples** split into four subsets based on reasoning difficulty and robustness:

- **Easy** (810 samples): Simple tabular problems with minimal structural variation.
- **Medium** (797 samples): More complex table structures with shuffled rows.
- **Hard** (797 samples): Additional augmentations such as column modifications make information retrieval more challenging.
- **Robust** (1000 samples): A specialized diagnostic set combining well-defined and trap problems for evaluating robustness.

Each subset is designed to stress different reasoning capabilities and is augmented accordingly.

## 📊 Augmentation Strategies

The following table summarizes the augmentation techniques applied to each subset:

| Subset | RowAug | Shuffle | ColAug | InfMod |
| ------ | ------ | :-----: | :----: | :----: |
| Easy   | 10     |         |        |        |
| Medium | 20     |    ✔    |        |        |
| Hard   | 20     |    ✔    |   4    |        |
| Robust | 20     |    ✔    |        |   ✔    |

> ✔ Checkmarks indicate usage of a specific augmentation strategy; numbers indicate specific usage counts.

### Definitions

- **RowAug**: Addition of distracting or redundant rows.
- **Shuffle**: Row order is randomized to challenge sequential assumptions.
- **ColAug**: Column structure or labeling is perturbed.
- **InfMod**: Introduces logical inconsistencies or missing information (used only in Robust).

## 🧠 Subset Details

- **Easy - > Hard**: These subsets progressively increase the structural complexity of tables. All require extracting relevant data but do **not** include intermediate variables or multi-step logic chains.
- **Robust**: This subset is crafted for testing **reasoning robustness**:
  - 50% well-defined problems (medium difficulty)
  - 25% trap problems with **contradictory** information
  - 25% trap problems with **missing** information