Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Anonymous

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone

Abstract: Despite the impressive capabilities of Large Language Models (LLMs) such as GPT-4, they still encounter challenges when it comes to generating complex, structured outputs. This study aims to assess the current capability of LLMs in generating structured data and proposes a novel structure-aware fine-tuning approach to enhance their ability in this aspect. Here we introduce Struc-Bench, a benchmark that includes representative LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), encompassing text tables, HTML, and LaTeX formats. To construct the benchmark, we employ FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Moreover, considering the lack of task-specific metrics, we introduce two novel metrics: P-Score (Prompting Score) and H-Score (Heuristical Score). Experimental results demonstrate that our structure-aware fine-tuning approach, applied to LLaMA-7B, significantly improves adherence to natural language constraints, surpassing other evaluated LLMs. Our analysis reveals common errors and areas open for improvement. Accordingly, we present an ability map across six dimensions (coverage, formatting, reasoning, comprehension, pragmatics, and hallucination), suggesting promising directions for future research.

Paper Type: short

Research Area: Generation

Contribution Types: Data resources

Languages Studied: English

Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.

0 Replies

Loading