FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

ACL ARR 2024 June Submission281 Authors

09 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been an expensive and laborious process, often reliant on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce Fanno, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, Fanno efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the Fanno can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Instruction-Tuning ; Free Generation

Contribution Types: NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 281

Loading