Keywords: Label Aggregation, LLM, Data Annotation
TL;DR: We train a language model to expertly aggregate conflicting labels and justifications from other LLMs, creating a highly accurate and general-purpose aggregator.
Abstract: The rise of large language models (LLMs) as annotators has introduced new opportunities and challenges for label aggregation in data annotation pipelines. While traditional aggregation methods are designed for human crowd workers with independent judgments, they fall short when applied to LLM-generated annotations that exhibit high correlation patterns and provide rich explanatory justifications. To address these challenges, we introduce RFAgg, a reinforcement learning framework that dynamically aggregates LLM annotations by jointly modeling both labels and their corresponding justifications. To train RFAgg, we construct the AGG dataset by collecting question-answer pairs generated by different LLMs across various datasets. Then, RFAgg first uses LLMs to generate multiple aggregation responses containing reasoning tokens and final answers for each input, and then uses our proposed aggregation reward functions to update the model via the policy optimization algorithm. Experiments demonstrate that RFAgg significantly outperforms classical and recent aggregation methods. Most notably, it serves as a general aggregation model, generalizing well to out-of-domain and previously unseen tasks. Despite being trained only on limited classification tasks, RFAgg achieves an average improvement of 2.45\% on diverse objective tasks and 5.2\% on the Alpaca 2.0 subjective task compared to its base model. We will publicly release the AGG dataset and our source code.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6659
Loading