StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

ACL ARR 2025 May Submission6919 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce **StructEval**, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: **(1)** *generation* tasks, producing structured output from natural language prompts, and **(2)** *conversion* tasks, translating between structured formats. Our benchmark encompasses $18$ formats and $44$ types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Large Language Models; Evaluation; Structure Generation; LLM for Visualization

Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 6919

Loading