Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Zhenghao Zeng; David Arbour; Avi Feller; Ishita Dasgupta; Atanu R. Sinha; Edward Kennedy

Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Zhenghao Zeng, David Arbour, Avi Feller, Ishita Dasgupta, Atanu R. Sinha, Edward Kennedy

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Missing data, cluster dependence, doubly robust estimation, causal inference, evaluation, human annotation

Abstract: Human annotations play a crucial role in evaluating the performance of GenAI models. Two common challenges in practice, however, are missing annotations (the response variable of interest) and cluster dependence among human-AI interactions (e.g., questions asked by the same user may be highly correlated). Reliable inference must address both issues to achieve unbiased estimation and appropriately quantify uncertainty when estimating average scores from human annotations. In this paper, we analyze the doubly robust estimator, a widely used method in missing data analysis and causal inference, applied to this setting and establish novel theoretical properties under cluster dependence. We further illustrate our findings through simulations and a real-world conversation quality dataset. Our theoretical and empirical results underscore the importance of incorporating cluster dependence in missing response problems to perform valid statistical inference.

Supplementary Material: zip

Primary Area: Probabilistic methods (e.g., variational inference, causal inference, Gaussian processes)

Submission Number: 18603

Loading