Automatically Enhanced Instruction-following Capabilities of Large Language Models via Execution Feedback

Automatically Enhanced Instruction-following Capabilities of Large Language Models via Execution Feedback

ACL ARR 2024 June Submission4837 Authors

16 Jun 2024 (modified: 20 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and Llama3, in self-alignment and strong-to-weak distillation settings.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Instruction Following, Large Language Models, Execution Feedback, On-policy Learning, Strong-to-Weak Distillation, Self-Alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 4837

Loading