RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu; Zhihan Zhang; Kate Lin; Yuwei Zhang; Akshay Paruchuri; Hong Yu; Mehran Kazemi; Kumar Ayush; A. Ali Heydari; Maxwell A Xu; Yun Liu; Ming-Zher Poh; Yuzhe Yang; Mark Malhotra; Shwetak Patel; Hamid Palangi; Xuhai Xu; Daniel McDuff; Tim Althoff; Xin Liu

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Agents, Tabular Reasoning, Data Science

TL;DR: A framework and benchmark to evaluate language models' reasoning on imperfect tabular data

Abstract: Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness—the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2,980 table-query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/kenqgu/RADAR

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Flagged For Ethics Review: true

Submission Number: 1350

Loading

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A Xu, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu