Abstract: Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so as there has been no systematic audit of adherence to these guidelines. We propose a simple but imperative baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. We forefront \emph{three-way consistency} between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge; an extension of prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~quanming_yao1
Submission Number: 7244
Loading