Evaluating Human–LLM Alignment Requires Transparent and Adaptable Statistical Guarantees

12 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human–LLM Alignment, Statistical Guarantees
Abstract: As Large Language Models (LLMs) become increasingly embedded in critical domains such as healthcare, education, and public services, ensuring their alignment with human values and intentions is of paramount importance. Misalignment in these contexts can lead to significant harm, underscoring the urgent need for rigorous, interpretable, and actionable evaluation methods. This position paper provides a critical examination of the current landscape of human–LLM alignment evaluation, with a particular focus on statistical guarantees in human annotation-based and LLM-based approaches. We identify key limitations in existing methodologies and advocate for the development of more transparent, interpretable, and adaptable frameworks for alignment guarantees. At the heart of our inquiry are two foundational questions: What constitutes a transparent foundation for alignment guarantees? And how can such guarantees be made operational and responsive to real-world conditions? We conclude by outlining future directions for designing alignment guarantee frameworks that are not only technically sound and transparent, but also socially attuned and practically adaptable.
Supplementary Material: zip
Submission Number: 244
Loading