Keywords: model compliance, evals, alignment auditing, LM as a judge
TL;DR: We present an automated framework that audits foundation models against their own published specifications, introducing a three-way consistency check between specification, model outputs, and provider judges.
Abstract: Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so as there has been no systematic audit of adherence to these guidelines. We propose a simple but imperative baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. Thus our central focus is on __three-way consistency__ between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge; an extension of prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 14626
Loading