"It Doesn’t Know Anything About my Work": Participatory Benchmarking and AI Evaluation in Applied Settings

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: evaluation, benchmarking, measurement, participatory, sociotechnical, applied, manufacturing
TL;DR: We report on a participatory benchmarking study of an AI assistant in manufacturing, showing how incorporating end-users’ situated expertise enables more nuanced, context-aware evaluations of model performance.
Abstract: This empirical paper investigates the benefits of socially embedded approaches to model evaluation. We present findings from a participatory benchmarking evaluation of an AI assistant deployed in a manufacturing setting, demonstrating how evaluation practices that incorporate end-users’ situated expertise enable more nuanced assessments of model performance. By foregrounding context-specific knowledge, these practices more accurately capture real-world functionality and inform iterative system improvement. We conclude by outlining implications for the design of context-aware AI evaluation frameworks.
Submission Number: 115
Loading