Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce **StructEval**, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: **(1)** *generation* tasks, producing structured output from natural language prompts, and **(2)** *conversion* tasks, translating between structured formats. Our benchmark encompasses $18$ formats and $44$ types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models; Evaluation; Structure Generation; LLM for Visualization
Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 6919
Loading