Abstract: Despite the impressive capabilities of Large Language Models (LLMs) such as GPT-4, they still encounter challenges when it comes to generating complex, structured outputs. This study aims to assess the current capability of LLMs in generating structured data and proposes a novel structure-aware fine-tuning approach to enhance their ability in this aspect. Here we introduce Struc-Bench, a benchmark that includes representative LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), encompassing text tables, HTML, and LaTeX formats. To construct the benchmark, we employ FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Moreover, considering the lack of task-specific metrics, we introduce two novel metrics: P-Score (Prompting Score) and H-Score (Heuristical Score). Experimental results demonstrate that our structure-aware fine-tuning approach, applied to LLaMA-7B, significantly improves adherence to natural language constraints, surpassing other evaluated LLMs. Our analysis reveals common errors and areas open for improvement. Accordingly, we present an ability map across six dimensions (coverage, formatting, reasoning, comprehension, pragmatics, and hallucination), suggesting promising directions for future research.
Paper Type: short
Research Area: Generation
Contribution Types: Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading