StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

TMLR Paper5631 Authors

14 Aug 2025 (modified: 27 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=vctkiwQvvb&nesting=2&sort=date-desc

Changes Since Last Submission: Changed the font to the correct TMLR format, and compared it against existing submissions.

Assigned Action Editor: ~Frederic_Sala1

Submission Number: 5631

Loading