Evaluating Large Language Models for Confidence-based Check Set Selection

Evaluating Large Language Models for Confidence-based Check Set Selection

ACL ARR 2024 December Submission1758 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios continues to be a challenge due to two issues: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly. We investigate LLMs' ability to identify low-confidence outputs for human review through "check set selection"--a process where LLMs prioritize information needing human judgment. Using a case study on social media monitoring for disaster risk management, we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort. We test two strategies for LLM check set selection: individual confidence elicitation -- LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and direct set confidence elicitation -- LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts. Our key contributions are: (1) we propose a novel performance metric for LLM-human collaboration in check set selection, (2) we compare individual and direct set-based selection strategies across input sizes and aggregation methods, and (3) we investigate LLMs' direct set selection capabilities from long-context inputs. Our results reveal that set selection via individual probabilities is more reliable but direct set confidence does show potential. Direct set selection challenges include such as inconsistent outputs, incorrect check set size, and low inter-annotator agreement. Despite these challenges, our approach improves collaborative disaster tweet classification, demonstrating the potential of human-LLM collaboration.

Paper Type: Long

Research Area: Human-Centered NLP

Research Area Keywords: human-in-the-loop, prompting, human-AI interaction, human-centered evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: english

Submission Number: 1758

Loading