FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

Kristýna Onderková; Ondrej Platek; Zdeněk Kasner; Ondrej Dusek

FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

Kristýna Onderková, Ondrej Platek, Zdeněk Kasner, Ondrej Dusek

Published: 18 Nov 2025, Last Modified: 18 Nov 2025AITD@EurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Recently published work (link only)

Keywords: natural language generation evaluation, evaluation, benchmark, table-to-text generation, insight generation, natural language generation, data-to-text generation, large language models

Abstract: Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.

Published Paper Link: https://arxiv.org/abs/2510.13598

Relevance Comments: FreshTab introduces a dynamic benchmark for evaluating AI models on tabular data by generating up-to-date tables from newly added Wikipedia pages, aligning with the workshop’s focus on methods and benchmarks for tabular data. It enables realistic, multilingual, and domain-balanced evaluation. We evaluate it on table-to-text generation (insight generation), assessing factuality with automatic metrics, human judgments, and LLM-as-a-judge. We found that automatic metrics correlate poorly with human judgments, while LLM-as-a-judge evaluators align better, and that domain effects reveal model weaknesses.

Published Venue And Year: INLG 2025

Submission Number: 34

Loading