Keywords: Evaluation validity, benchmarking, culture, LLM
TL;DR: We present a validity framework and LLM-powered pipeline that augments human experts' ability to assess AI benchmark validity within specific use cases and target populations.
Abstract: Evaluations of modern AI systems largely originate from English-speaking, Western nations, posing adoption challenges for other regions due to language resource scarcity, misalignment in cultural values, and blind spots to region-specific problems, knowledge, and perspectives. In this paper, we present a framework and automated pipeline to assess the applicability of AI benchmarks in the context of specific deployment use cases and target populations. Our framework is structured around the ontological, instance-level, and representational components of benchmark inputs and outputs, specifying the conditions under which benchmark evaluation validity would be violated if transferred to a different cultural or geographic context. To enable scalable validity analysis, our automated pipeline leverages large language models to evaluate benchmark-use-population triplets across our six validity dimensions. We validate this pipeline through human expert studies and apply it to assess 24 benchmark-use-population triplets spanning five global regions, surfacing systematic patterns in how porting strategies affect validity. We conclude with policy recommendations for actors to improve the benchmarking ecosystem in their respective regional contexts.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 127
Loading