Keywords: model compliance, LM as a judge
TL;DR: We present an automated framework that audits foundation models against their own published specifications, introducing a three-way consistency check between specification, model outputs, and provider judges.
Abstract: Foundation models are deployed widely, yet it remains unclear how consistently they follow the behavioral guidelines promised by their own developers. Providers such as OpenAI, Anthropic, and Google publish detailed specifications describing both safety constraints and qualitative traits, but there has been no systematic audit of adherence to these documents. We introduce an automated framework that audits models against their providers’ specifications by (i) parsing behavioral statements, (ii) generating targeted prompts, and (iii) using models as judges. Our central focus is on three-way consistency between a provider’s specification, its model outputs, and its own models-as-judges; an extension of prior two-way GV-consistency. This sets a necessary bar: models should at least satisfy their own specifications when judged by their own evaluators. Applying our framework to 16 models from six developers across 100+ behavioral statements, we find systematic inconsistencies, with compliance gaps of up to 20\% across providers.
Submission Number: 49
Loading