Keywords: Large Language Model, finetuning, data efficiency, task difficulty, annotation cost reduction
TL;DR: We explore and propose metrics to efficiently estimate the fine-tuning data size required to achieve a desired performance level.
Abstract: While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task's \textit{data efficiency} --- i.e., the number of fine-tuning examples needed to achieve a desired level of performance --- is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task's data efficiency \textit{without} requiring incremental annotation. After introducing a concrete metric that quantifies a task's data efficiency, we propose using the \textit{gradient cosine similarity of low-confidence examples} as a way to predict data efficiency based on a small number of labeled samples. We validate our approach on the collected set of tasks with varying data efficiencies, attaining 8.6 % error in overall data efficiency prediction and eliminating hundreds of unnecessary annotations. Our experiment results and implementation code are available in the supplementary material.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 18545
Loading