SpecEval: Evaluating Model Adherence to Behavior Specifications

Published: 11 May 2026, Last Modified: 11 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear whether models actually do so, since there has been no systematic audit of adherence to these guidelines. We propose a simple but important baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. We focus on \emph{three-way consistency}: the relationship between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge, extending prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers, as measured by each provider's own model acting as judge.
Submission Type: Regular submission (no more than 12 pages of main content)
Code: https://github.com/ahmeda14960/specevaldataset
Assigned Action Editor: ~quanming_yao1
Submission Number: 7244
Loading