Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

ICLR 2026 Conference Submission4144 Authors

11 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-source LLMs, safety, frontier risks

Abstract: In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 4144

Loading