OrQA – Open Data Retrieval for Question Answering Dataset Generation

Giovanni Malaguti; Angelo Mozzillo; Giovanni Simonini

OrQA – Open Data Retrieval for Question Answering Dataset Generation

Giovanni Malaguti, Angelo Mozzillo, Giovanni Simonini

Published: 05 Jun 2025, Last Modified: 29 Jun 2025TRL@ACL2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Table Question Answering, Dataset Generation, Large Language Models, Open Data, SQL, Data Integration, Agents

TL;DR: We introduce OrQA, an agentic framework that generates realistic tabular QA datasets from open government data using LLMs to enable scalable and diverse dataset creation.

Abstract: We present OrQA, a novel agentic framework to generate large-scale tabular question-answering (TQA) datasets based on real-world open data. Such datasets are needed to overcome the limitations of existing benchmark datasets, which rely on synthetic questions or limited web tables. OrQA employs LLM agents to retrieve related open data tables, generate natural questions, and synthesize executable $\texttt{SQL}$ queries---involving joins, unions, and other non-trivial operations. By leveraging hundreds of GPU hours on four NVIDIA A100, we applied OrQA to Canadian and UK government open data to produce 1,000 question-tables–SQL triples, a representative sample of which has been human‑validated. This open‑source dataset is now publicly available to drive transparency, reproducibility, and progress in table‑based question answering.

Include In Proceedings: Yes

Copyright Form: pdf

Submission Number: 20

Loading