Abstract: The zero- and few-shot prompting paradigms in large language models (LLMs) have significantly improved the accessibility and flexibility in language-related tasks, where the need for task-specific architecture design or supervision is eliminated. Along with the convenience, these paradigms also introduce the requirements of output format specifications, thereby mandating users to devise an output format and include it in the prompt as a request for LLMs to faithfully adhere to. To study the ability of LLMs to comply with format specifications, we identify the concept of format faithfulness. Based on the formal definition and the detailed taxonomization of format faithfulness, we present FormatBench, a benchmark that covers full categories of format faithfulness in our taxonomy and a wide range of LLM application scenarios. Extensive experiments on FormatBench reveal that state-of-the-art LLMs can still have difficulties in generating basic structured output as instructed. To improve the format faithfulness of LLMs, we design and implement three adaptation approaches, namely format regulation, format tuning, and format refinement. Detailed analyses of these approaches validate their effectiveness in improving format faithfulness rate by up to 9.8%. Our codes and datasets are publicly available at Anonymous Link.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Theory
Languages Studied: English, German
0 Replies
Loading