Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs

Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs

ICLR 2026 Conference Submission7971 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial robustness, LLMs, machine learning, distillation, jailbreaks, classifier guarded system, adversarial attacks, safety

Abstract: Model developers apply safeguards to frontier models to resist misuse by adversaries; for example, they fine-tune models to refuse harmful requests and filter harmful outputs with classifiers. In this work, we show that we can partially circumvent even idealized versions of these safeguards with \emph{elicitation attacks}. Our elicitation attacks use only benign capabilities of a frontier model to elicit harmful capabilities from open-source models. Specfically, they (i) generate harmless prompts that are adjacent to target harmful tasks, (ii) generate outputs to these prompts with frontier models, then (iii) fine-tune open-source models on these prompt-output pairs. Using dangerous chemical synthesis and processing as a case study, we find that our elicitation attacks enable \emph{harmless-to-harmful generalization}: our fine-tuned models recover \textasciitilde40\% of the performance gap between the original open-source model and the frontier system without safeguards, despite only using harmless outputs from the frontier system. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work underscores the challenge of output-level safeguards and the pressing need for better distributed threat modeling.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 7971

Loading