Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding

Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding

ACL ARR 2025 February Submission2107 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Inductive coding traditionally relies on labor-intensive human efforts, who are prone to inconsistencies and individual biases. Although large language models (LLMs) offer promising automation capabilities, their standalone use often results in inconsistent outputs, limiting their reliability. In this work, we propose a framework that combines ensemble methods with code refinement methodology to address these challenges. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate human consensus. To address the limitations of metrics like ROUGE and BERTScore, we introduce a composite evaluation metric that combines code conciseness and contextual similarity. The validity of this metric is confirmed through correlation analysis with human expert ratings. Results demonstrate that smaller ensemble models with refined outputs consistently outperform other ensembles, individual models, and even large-scale LLMs like GPT-4. Our evidence suggests that smaller ensemble models significantly outperform larger standalone language models, pointing out the risk of relying solely on a single large model for qualitative analysis.

Paper Type: Long

Research Area: Computational Social Science and Cultural Analytics

Research Area Keywords: NLP tools for social analysis; human-computer interaction; evaluation and metrics

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data analysis

Languages Studied: English

Submission Number: 2107

Loading