Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Keywords: auditing, open-weight governance, safety, finetuning
Abstract: Auditing the finetunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, \textit{generative evaluation} via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like child sexual abuse material (CSAM) where generation is legally constrained. This motivates the {\em Evaluation without Generation} problem: assessing model capabilities without producing outputs. In such settings, capability must be inferred from the model's state, either its parameters or internal representations, rather than its outputs. We introduce {\em Gaussian probing}, a method that characterizes how LoRA adaptors functionally perturb a model by measuring its internal responses to a reference ensemble of Gaussian latent states. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful specialization without sampling outputs. We demonstrate effectiveness in high-risk domains, including detecting models specialized for CSAM under realistic constraints. Our results show that Gaussian probing provides a scalable non-generative alternative for evaluating high-risk generative systems and remains robust to weight rescaling, a representative adversarial manipulation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 194
Loading