
\documentclass[10pt]{article} % For LaTeX2e
% \usepackage{tmlr}
% If accepted, instead use the following line for the camera-ready submission:
\usepackage[accepted]{tmlr}
% To de-anonymize and remove mentions to TMLR (for example for posting to preprint servers), instead use the following:
% \usepackage[preprint]{tmlr}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
% \input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}




%%%%%% BEGIN ADDED BY US
\usepackage{cleveref}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{fontawesome5}
\usepackage{graphicx}
% \usepackage[table]{xcolor} % needed for cell colors
\usepackage[dvipsnames,table,xcdraw]{xcolor}
\renewcommand{\labelitemii}{$\circ$}

\newcommand*\colourcheck[1]{%
 \expandafter\newcommand\csname #1check\endcsname{\textcolor{#1}{\faCheck}}%
}
\newcommand*\colourx[1]{%
 \expandafter\newcommand\csname #1x\endcsname{\textcolor{#1}{\faTimes}}%
}

\colourcheck{ForestGreen}
\colourx{red}

\definecolor{darkgrey}{RGB}{100,100,100}
\definecolor{lightblue}{RGB}{135,170,230}
\definecolor{blue}{RGB}{0,100,200}
\definecolor{darkblue}{RGB}{0,20,150}
\newcommand\rb[1]{\textcolor{magenta}{[RB: #1]}}
\newcommand\srb[1]{\textcolor{blue}{[SB: #1]}}
\newcommand\cas[1]{\textcolor{purple}{[Cas: #1]}}
\newcommand\ilia[1]{\textcolor{green}{[IS: #1]}}
\newcommand\ag[1]{\textcolor{red}{[AG: #1]}}
%%%%%% END ADDED BY US






\title{Open Technical Problems in Open-Weight AI Model Risk Management}

% Authors must not appear in the submitted version. They should be hidden
% as long as the tmlr package is used without the [accepted] or [preprint] options.
% Non-anonymous submissions will be rejected without review.

\author{\vspace{-7pt}\name Stephen Casper \addr MIT CSAIL \email scasper@mit.edu 
      \AND
      \vspace{-7pt}\name Kyle O'Brien
      \addr ERA Fellowship
      \AND
      \vspace{-7pt}\name Shayne Longpre
      \addr MIT
      \AND
      \vspace{-7pt}\name Elizabeth Seger
      \addr Demos
      \AND
      \vspace{-7pt}\name Kevin Klyman
      \addr Stanford University
      \AND
      \vspace{-7pt}\name Rishi Bommasani
      \addr Stanford University
      \AND
      \vspace{-7pt}\name Aniruddha Nrusimha
      \addr MIT CSAIL
      \AND
      \vspace{-7pt}\name Ilia Shumailov
      \addr AI Sequrity Company
      \AND
      \vspace{-7pt}\name S\"oren Mindermann
      \addr Mila, Universit\'e de Montr\'eal
      LawZero
      \AND
      \vspace{-7pt}\name Steven Basart
      \addr Center for AI Safety
      \AND
      \vspace{-7pt}\name Frank Rudzicz
      \addr Dalhousie University
      Vector Institute
      \AND
      \vspace{-7pt}\name Kellin Pelrine
      \addr FAR.AI
      \AND
      \vspace{-7pt}\name Avijit Ghosh 
      \addr Hugging Face
      \AND
      \vspace{-7pt}\name Andrew Strait
      \addr UK AI Security Institute
      \AND
      \vspace{-7pt}\name Robert Kirk
      \addr UK AI Security Institute
      \AND
      \vspace{-7pt}\name Dan Hendrycks
      \addr Center for AI Safety
      \AND
      \vspace{-7pt}\name Peter Henderson
      \addr Princeton University
      \AND
      \vspace{-7pt}\name Zico Kolter
      \addr Carnegie Mellon University
      \AND
      \vspace{-7pt}\name Geoffrey Irving
      \addr UK AI Security Institute
      \AND
      \vspace{-7pt}\name Yarin Gal
      \addr UK AI Security Institute
      OATML, University of Oxford
      \AND
      \vspace{-7pt}\name Yoshua Bengio 
      \addr Mila, Universit\'e de Montr\'eal
      \AND
      \vspace{-7pt}\name Dylan Hadfield-Menell 
      \addr MIT CSAIL
}

% The \author macro works with any number of authors. Use \AND 
% to separate the names and addresses of multiple authors.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\def\month{03}  % Insert correct month for camera-ready version
\def\year{2026} % Insert correct year for camera-ready version
\def\openreview{\url{https://openreview.net/forum?id=8QyGLnFkzc}} % Insert correct link to OpenReview for camera-ready version


\begin{document}


\maketitle

\begin{abstract}
Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted.
However, compared to proprietary models, open-weight models pose different opportunities and challenges for effective risk management.
For example, they allow for more open research and testing.
However, managing their risks is also challenging because they can be modified arbitrarily, used without oversight, and spread irreversibly.
Currently, there is limited research on safety tooling specific to open-weight models. 
Addressing these gaps will be key to both realizing their benefits and mitigating their harms. 
In this paper, we present 16 open technical challenges for open-weight model safety involving training data, training algorithms, evaluations, deployment, and ecosystem monitoring.
We conclude by discussing the nascent state of the field, emphasizing that openness about research, methods, and evaluations -- not just weights -- will be key to building a rigorous science of open-weight model risk management. 
\end{abstract}


\section{Introduction}
\label{sec:intro}

Open-weight AI models -- models whose weights are publicly available to download -- have quickly grown in their capabilities and prominence \citep{cottier_how_2024, bhandari2025forecasting}.
2025 has been a major year for advanced open language, image, and video models (see \Cref{tab:developers} for examples). 
Simultaneously, proprietary model developers have reported that their models are approaching critical risk thresholds \citep{google2025gemini2_5_pro_preview, anthropic2025asl3, OpenAI2025_ChatGPT_Agent}.
Meanwhile current work estimates that the capabilities of frontier open-weight models only lag 6 to 12 months behind \citep{cottier_how_2024, maslej2025artificial}. 
This suggests that open-weight models could soon cross critical capability thresholds.
% the spread and capabilities of open-weight AI models will be a critical factor in both the benefits and risks of AI overall.


Open- versus closed-weight deployments come with different safety tradeoffs.
While open-weight models allow for more open research and testing, they also come with a greater potential for misuse.
% For example, open-weight models can be more easily and transparency tested for risks.
% However, they also pose distinct safety challenges compared to closed models.
% The developer of a closed-weight model fully controls its points of access. This allows all external uses of the model to be subject to content moderation (e.g., \citealp{sharma2025constitutional}), misuse monitoring (e.g., \citealp{openai_disruptingMalicious2025, anthropic2025threat}), and the enforcement of acceptable use policies (e.g., \citealp{klyman_acceptable_2024}).
Frontier closed weight model developers often rely on a complex combination of training interventions, content classifiers, and enforcement of acceptable use policies to reduce risks (e.g., \citealp{google2025gemini2_5_pro_preview, anthropic2025asl3, OpenAI2025_ChatGPT_Agent}).
However, none of these methods provide reliable assurances for open-weight models, which can be used, tampered with, and distributed without limitations.
% , centralized oversight, or the option to roll back existing copies of the model. 
Compared to closed-weight models, the attack surface for open-weight models is larger, and the toolkit of reliable techniques for defending them is less well studied. 
% These challenges also present themselves amidst limited understanding of the spread, usage, and impacts of models in the open-weight AI ecosystem. 
Furthermore, as increasing numbers of open models are released and shared \citep{bhandari2025forecasting}, it is difficult to understand the extent of their spread, usage, and impacts.

Building the field of technical safeguards for open models will be key to capturing their benefits and minimizing their risks. 
In \Cref{sec:why}, we expand on difficulties, highlighting how \textit{tampering threats} and the \textit{diffuse spread of open models} are defining challenges for managing their risks.
\Cref{sec:toolkit} then outlines technical objectives targeting these gaps. We organize them into five categories spanning the model lifecycle. Objectives 1-3 address threats to open-weight models from harmful tampering, while 3-5 focus on improving access to actionable information about the real-world uses and risks of open models:
\begin{enumerate}
    \item \textit{Training data curation} methods for preventing models from learning harmful capabilities (\Cref{sec:training}). Recent research has shown that these methods are effective at making open-weight models resist few-shot learning harmful behaviors. 
    \item \textit{Tamper-resistant training and `unlearning' algorithms} for building additional defenses against malicious fine-tuning and other forms of tampering (\Cref{sec:tamper}).
    \item \textit{Model tampering evaluations} for testing model risks under misuse threats from harmful tampering (\Cref{sec:evaluations}). These methods are necessary to evaluate real-world risks from downstream model modifications to open-weight models.
    \item \textit{Staged deployment strategies} which allow developers to experiment with partial access before a full open release (\Cref{sec:staged}). These techniques allow developers to monitor for unexpected uses and modify their plans for safeguards and release before a model is made openly available.
    \item \textit{Model provenance and forensics} strategies for monitoring real-world uses and impacts (\Cref{sec:provenance}). These strategies offer tools for developers, academics, and other stakeholders to study the diffuse open-weight model ecosystem.  
\end{enumerate}
% ; \textit{tamper-resistant training algorithms} designed to improve model robustness to malicious tampering (\Cref{sec:tamper}); \textit{tampering evaluations} to test model risks under real-world misuse threats (\Cref{sec:evaluations}); \textit{staged deployment} strategies which allow developers to experiment with partial access before a full open release (\Cref{sec:staged}); and post-deployment ecosystem monitoring tools for model provenance and forensics (\Cref{sec:provenance}). 







\section{Why is open-weight model risk management challenging?} \label{sec:why}


Open-weight models can be used and adapted widely without centralized control. This is key to their benefits, enabling more widespread research and decentralization of power \citep{bommasani_considerations_2024, ntia_dualUseFoundationModels2024, seger2024openhorizons, eiras_risks_2024, kapoor_societal_2024, francois_different_2025, longpre_-house_2025, Miller2025OpenModels}. For example, the release of DeepSeek R1 has enabled independent safety research of near-frontier models \citep{goodfire_deepseek_2025, zhou2025hiddenriskslargereasoning}. However, these same characteristics also make open-weight model risk management distinctly challenging. 


\textbf{Users can disable safety tools that are external to the model.} A key strategy for managing frontier AI risks is to augment models with external safeguards to monitor for signs of risk and intervene to prevent harm \citep{bengio2025international, bengio_singapore_2025, sharma2025constitutional, korbak_sketch_2025}. For example, it is common for AI models to be deployed with input and/or output filters designed to detect and block harmful uses. These kinds of tools can be very valuable to release alongside open-weight models, but they are also trivial for users with access to the model to disable.



\textbf{Downstream users can \textit{tamper} with open models via fine-tuning or other modifications to remove safeguards or add harmful capabilities.} 
% Improving the safety of model outputs has long been a major goal of machine learning research (e.g., \citealp{chen2022adversarial}). 
Open and closed-weight AI models alike can both be vulnerable to jailbreaks or other adversarial prompts that elicit harmful behavior. 
% Despite continued challenges in ensuring the worst-case robustness of frontier AI models to adversarial inputs (e.g., \citealp{bai_training_2022, bengio2025international}), progress has been made in developing more black-box robust models. 
% Even when a closed-weight model has a fine-tuning API, developers can still monitor data and model behaviors (e.g., \citealp{openai2024gpt4o-finetuning}).
However, open-weight models are additionally prone to powerful tampering threats. Benign fine-tuning, adversarial fine-tuning, and other modifications to a model have been shown to effectively elicit harmful behaviors and capabilities -- even from models that are relatively safe off the shelf (e.g., \citealp{qi_fine-tuning_2023, hu_unlearning_2025, greenblatt_stress-testing_2024, wei_assessing_2024, hofstatter_elicitation_2025, che_model_2025}).\footnote{Notably, several of these papers cited here demonstrated model vulnerabilities to tampering threats via fine-tuning APIs.} 
There is growing precedent for open-weight models being fine-tuned and shared specifically for harmful uses. For example, modified open-weight diffusion models have become the most common tools used for creating synthetic child sexual abuse material \citep{IWF2024, hawkins_deepfakes_2025, vaughan_ncmec_nodate}. Recent research has also identified open-weight model variants that have been fine-tuned specifically to perform malicious tasks \citep{simonovich_cato_2025}.
Meanwhile, thousands of open-weight \href{https://huggingface.co/models?sort=trending&search=uncensored}{text} \href{https://huggingface.co/models?sort=trending&search=abliterated}{models} have been specifically fine-tuned to disable safeguards.
With enough fine-tuning on enough data, safeguards for any model can be undone, meaning that practical anti-tampering techniques can only hope to make harmful forms of fine-tuning sufficiently onerous.



\textbf{Open-weight models can be spread quickly and irreversibly.} If a closed-weight model is found to pose hazards, risk-conscious developers can add patches or pull the model from distribution. 
Consider, for example, OpenAI's April 2025 update of GPT-4o. After release, external evaluation identified excessive sycophancy and encouragement of self-harm. In response, OpenAI reverted to a previous version of the model \citep{openai_sycophancy2025}. 
In contrast, OpenAI's open-weight release of gpt-oss-120b, which currently has over \href{https://huggingface.co/openai/gpt-oss-120b}{3 million monthly downloads from HuggingFace}, was not reversible.
While ceasing service to a model can make it much less accessible (e.g., \citealp{solaiman_beyond_2025, seger2024openhorizons}), there is no reliable way to prevent existing copies of the model from being used and shared. 


\textbf{Open-weight models cannot be centrally monitored or moderated.} When closed-weight models are released, they are generally made available through an API controlled by the model deployer. 
This allows for the developer to use `Know Your Customer' strategies \citep{jami_pour_how_2024}, monitor for misuse \citep{openai_disruptingMalicious2025, yueh-han_monitoring_2025, brown_benchmarking_2025}, and enforce acceptable usage policies. 
In contrast, open-weight models generally cannot be centrally monitored.

\textbf{Open-weight models have more complex supply chains than closed models.} Model supply chains involve many resources, including talent, data, compute, and infrastructure \citep{cen2023aisupply, longpre2023data}.
Closed-weight models can be developed by multiple actors. However, they tend to be developed in a relatively centralized and coordinated way. In contrast, open-weight models more often result from many stages of modification and redistribution, often across jurisdictions. Models with complex supply chains are more prone to single actors introducing harmful behaviors (including backdoors, \citealp{hanif2025survey}), and they make it more difficult to determine accountability for harms \citep{nissenbaum1996accountability, cooper2022accountability}.
% This makes it challenging to understand the open weight model ecosystem and determine accountability within it.

\textbf{It can be difficult to track the real-world spread, usage, and impacts of open-weight models.} Because their usage is typically much less centralized, it is often hard for researchers to thoroughly understand the spread, uses, and impact of open-weight models. This makes it more difficult to study risks, perform cost-benefit analysis, and identify effective points of intervention in the open-weight model ecosystem. 


\section{The Toolkit} \label{sec:toolkit}


\subsection{Technical Safeguards for Open-Weight models} \label{sec:technical}


\textbf{Scope and relation to prior work:} 
% This paper builds on prior work that has discussed the benefits, risks, and governance considerations for highly capable open-weight models \citep{liang2022community-norms, seger_open-sourcing_2023, chan_hazards_2023, bommasani_considerations_2024, ntia_dualUseFoundationModels2024, seger2024openhorizons, eiras_risks_2024, kapoor_societal_2024, longpre_responsible_2025, bengio2025international, gal2025customizable}. 
This paper builds on prior research on open-weight models and their implications for risk management.
This includes prior work on outlining the risks and benefits of highly capable open-weight models \citep{chan_hazards_2023, seger_open-sourcing_2023, eiras_risks_2024, kapoor_societal_2024, bengio2025international, bengio2025international}, proposals for risk management frameworks \citep{liang2022community-norms, longpre_responsible_2025, gal2025customizable}, and considerations for governance \citep{bommasani_considerations_2024, ntia_dualUseFoundationModels2024}.
In particular, our work is closely related to \citet{seger2024openhorizons}, \citet{francois_different_2025}, \citet{pai_riskMitigation2024}, and \citet{aisi_managingRisks2025}, which each taxonomize and overview approaches for managing risks from open-weight AI models. To complement these past works, \textbf{we focus only on open problems for technical safeguards\footnote{We define a ``technical safeguard'' as a technique which uses methods from machine learning research.} that have distinct implications\footnote{We consider a safeguard to have ``distinct implications'' for open-weight models if it is applicable for open-weight releases and cannot be trivially disabled. For example, we do not consider filters for harmful outputs to have disinct implications for open-weight model risk management as they are also used for closed-weight models and can be trivially disabled in open-weight models.} for open-weight models.}


\textbf{Taxonomizing technical safeguards with distinct applicability to open-weight model risk management:} Based in part on taxonomies provided in \citet{seger2024openhorizons}, \citet{francois_different_2025}, \citet{pai_riskMitigation2024}, and \citet{aisi_managingRisks2025}, we organize technical safeguards with distinct implications for open-weight model risk management in \Cref{tab:taxonomy}. Our taxonomy includes five categories corresponding to different stages of a model's lifecycle: training data curation, training, evaluation, deployment, and post-deployment monitoring. 



\begin{table*}[h!]
\small
\centering
\resizebox{\textwidth}{!}{
\begin{tabular}{l|cccc}
\toprule
\textbf{Approach}& \citet{seger2024openhorizons} & \citet{francois_different_2025} & \citet{pai_riskMitigation2024} & \citet{aisi_managingRisks2025} \\ \midrule
\begin{tabular}[c]{@{}l@{}}\textbf{Training Data Curation}\\ (\Cref{sec:training})\end{tabular} & \ForestGreencheck & \ForestGreencheck & \ForestGreencheck & \ForestGreencheck \\ 
\begin{tabular}[c]{@{}l@{}}\textbf{Tamper-Resistant Training}\\ \& \textbf{Unlearning Algs.}\\ (\Cref{sec:tamper})\end{tabular} & \ForestGreencheck & \ForestGreencheck & \ForestGreencheck & \ForestGreencheck \\ 
\begin{tabular}[c]{@{}l@{}}\textbf{Model Tampering Evaluations}\\ (\Cref{sec:evaluations})\end{tabular} & \ForestGreencheck & \redx & \ForestGreencheck & \ForestGreencheck \\ 
\begin{tabular}[c]{@{}l@{}}\textbf{Staged Deployment Strategies}\\ (\Cref{sec:staged})\end{tabular} & \ForestGreencheck & \redx & \ForestGreencheck & \ForestGreencheck \\ 
\begin{tabular}[c]{@{}l@{}}\textbf{Model Provenance }\& \textbf{Forensics}\\ (\Cref{sec:provenance})\end{tabular} & \ForestGreencheck & \redx & \ForestGreencheck & \ForestGreencheck \\ \bottomrule
\end{tabular}
}
\caption{Our taxonomy of technical safeguards with distinct implications for open-weight models sorts methods into five categories corresponding to the different stages of a model's lifecycle: training data curation, training, evaluation, deployment, and post-deployment monitoring. In this table, we overview these methods' coverage in prior public work.}
\label{tab:taxonomy}
\end{table*}





\subsection{What this Paper Does Not Cover} \label{sec:what}

The primary goal of this paper is to help build the technical science of open-weight model risk management. As a result, we only focus on open problems for technical safeguards with distinct implications for open-weight models. However, this is not to say that other risk-management strategies are not crucial for managing risks from open-weight models. For the purposes of this paper, the following strategies discussed by \citet{bommasani_considerations_2024}, \citet{ntia_dualUseFoundationModels2024}, \citet{seger2024openhorizons}, \citet{francois_different_2025}, \citet{pai_riskMitigation2024}, and \citet{aisi_managingRisks2025} are out of scope:

\begin{itemize}
    \item Technical tools for AI risk management that do not have distinct implications for open-weight models:
    \begin{itemize}
        \item General (non-tamper-resistant) safety fine-tuning techniques \citep{seger2024openhorizons, francois_different_2025, pai_riskMitigation2024, aisi_managingRisks2025}
        \item Safety scaffolding and content moderation tools \citep{seger2024openhorizons, francois_different_2025, pai_riskMitigation2024, aisi_managingRisks2025}
        \item Rigorous black-box evaluations \citep{seger2024openhorizons, francois_different_2025, pai_riskMitigation2024, zhu2025establishing}
        \item Data provenance/forensics \citep{bommasani_considerations_2024, seger2024openhorizons, francois_different_2025, pai_riskMitigation2024, aisi_managingRisks2025, longprebridging}
        \item Monitoring fine-tuning APIs (see also \citealp{halawi_covert_2024, davies_fundamental_2025})
    \end{itemize}
    \item Nontechnical approaches for open-weight model risk management:
    \begin{itemize}
        \item Having an acceptable use policy and proactively enforcing violations \citep{seger2024openhorizons, pai_riskMitigation2024}
        \item Transparency and documentation of code, methods, and evaluation results \citep{seger2024openhorizons, pai_riskMitigation2024, aisi_managingRisks2025}
        \item Know-your-customer methods \citep{seger2024openhorizons, aisi_managingRisks2025} (see also \citealp{jami_pour_how_2024})
        \item Implementing incident reporting policies and infrastructure \citep{seger2024openhorizons, pai_riskMitigation2024} (see also \citealp{cattell_coordinated_2024, mcgregor2025err, longpre_-house_2025})
        \item Monitoring misuse, unintended uses, and user feedback \citep{ntia_dualUseFoundationModels2024, seger2024openhorizons, pai_riskMitigation2024}
        \item Pulling and replacing models found to pose hazards \citep{seger2024openhorizons, aisi_managingRisks2025}
    \end{itemize}
    \item Accelerating progress in the development of open-weight models with beneficial societal impacts \citep{bommasani_considerations_2024, seger2024openhorizons, ntia_dualUseFoundationModels2024, francois_different_2025}
    \item Governance strategies that carry distinct implications for the open-weight model ecosystem \citep{bommasani_considerations_2024, seger2024openhorizons, ntia_dualUseFoundationModels2024, francois_different_2025}
\end{itemize}







\section{Open Technical Problems} \label{sec:open}

Here, following the taxonomy in \Cref{tab:taxonomy} above, we discuss open technical problems in open-weight model risk management spanning the model lifecycle from pretraining to post-deployment ecosystem monitoring. 
The first three challenges we discuss pertain to open-weight model resistance to harmful fine-tuning and other forms of tampering. 
\Cref{sec:training} focuses on \textit{training data} curation and its implications for how well a final model resists harmful tampering. 
\Cref{sec:tamper} additionally covers \textit{training} algorithms to make models resistant to tampering.
\Cref{sec:evaluations} focuses on \textit{evaluating} open-weight models under realistic tampering threats. 
The final two challenges pertain to how models are released and monitored after they are developed. 
\Cref{sec:staged} discusses strategies for \textit{deploying} models in stages. 
Finally, \Cref{sec:provenance} covers tools to facilitate the \textit{post-deployment monitoring} of model uses and impacts in the open-weight ecosystem



\subsection{Training Data Curation} \label{sec:training}


\textbf{Training data curation is increasingly understood as a key intervention for improving model safety both off-the-shelf and under tampering.} 
Frontier AI models are prone to learning harmful information during training (e.g., \citealp{phuong_evaluating_2024, anthropic2025asl3}). 
% In particular, models are understood to gain most of their core knowledge during pretraining, which allows that knowledge to easily be utilized during fine-tuning.
% However, once knowledge has been internalized by a model, it is difficult to remove or substantially modify \citep{jia2021proofoflearning, anwar2024foundational}.
An intuitive countermeasure is to prevent models from learning harmful capabilities by minimizing exposure to unsafe training data. This aligns with evidence that models acquire most of their core knowledge during pretraining \citep{zhou_lima_2023, raghavendra_revisiting_2024, chang_how_2024}, where they are exposed to a diverse corpus of data spanning trillions of tokens (e.g. \citealp{yang_qwen3_2025, meta_llama4Multimodal2025, agarwal2025gpt}).
However, once knowledge has been internalized by a model, it is empirically difficult to remove or substantially modify \citep{jia2021proofoflearning, anwar2024foundational}.
This suggests that training data interventions, particularly during pretraining, could have the potential to shape the core knowledge, concepts, and propensities of the model towards safer outcomes. 
% Such interventions are especially relevant for open-weight models, as they lack the reliable auxiliary safeguards for suppressing unsafe acquired knowledge that are common in closed-weight deployments. 

\textbf{Existing work:} Curating training data at scale and filtering harmful content (such as text with instructions for performing illegal actions or images/videos depicting child sexual abuse) are widely understood as a key means of training safer AI models (e.g., \citealp{nichol_glide_2022, longpre_pretrainers_2023, thorn2024reducing, seger2024openhorizons, longpre_responsible_2025, liu_robustifying_2024, francois_different_2025, pai_riskMitigation2024}). Blocking data from known risky sources, such as websites with a high prevalence of adult and toxic content, is a common but not standardized practice in dataset design \citep{soldaini2024dolma, penedo2024fineweb, li2024datacomp}. 
Several works on open datasets have made significant contributions by releasing their data curation pipelines \citep{gao2020pile, laurenccon2023bigscience, raffel2019exploring, soldaini2024dolma,kandpal2025common}. 
Other recent works have begun to study how pretraining data curation can be used to prevent unsafe capabilities \citep{korbak2023pretraininglanguagemodelshuman, chen2025pretraining, aisi_managingRisks2025, obrien_deep_2025, wallace_estimating_2025, lee_distillation_2025, liu2025pharmacist}. These works highlight notable successes, limitations, and open questions.


\textbf{4.1.1. How does data curation’s effectiveness differ across harm categories?} Recent work has shown both successes \citep{maini_safety_2025, lee_distillation_2025, chen2025pretraining, obrien_deep_2025, albalaksurvey, liu2025pharmacist} and limitations \citep{li2025bad, wallace_estimating_2025, wei2025best} of data curation in scoping model capabilities. For instance, recent works have demonstrated pretraining filtering’s ability to significantly reduce models’ knowledge of biorisk-related topics \citep{chen2025pretraining, obrien_deep_2025}. However, \citet{wallace_estimating_2025} found an implementation of filtering to be ineffective when applied to gpt-oss. These findings suggest that filtering has the potential to be effective in preventing technical capability, but that it may be sensitive to implementational details. 
However, precise comparisons are challenging due to model cards often lacking information, such as the amount of data filtered or the amount of compute spent on filtering. 
Taken together, current works suggest that filtering data related to entire science or engineering domains can build more durable safeguards into models \citep{lee_distillation_2025, obrien_deep_2025}, while filtering data related to simpler propensities (such as toxicity or refusal of harmful requests) \citep{maini_safety_2025, li2025bad} or more niche science topics \citep{wei2025best} does not.
% \citet{obrien_deep_2025} speculated that it may not be possible to make models resistant to learning capabilities that do not depend on a specific body of niche knowledge, but this hypothesis currently remains uninvestigated. 
% Conditional pretraining may be an effective alternative to outright removing unsafe data \citep{korbak2023pretraininglanguagemodelshuman}. 
Open questions remain regarding data filtering's ability to limit potentially unsafe capabilities that are closely intertwined with beneficial capabilities, such as offensive hacking and defensive cybersecurity \citep{barez_open_2025}. 


\textbf{4.1.2. How can scaling data curation be scaled across languages, modalities, and data/model sizes?} Training data curation at scale is deceptively difficult \citep{paullada_data_2021} due to costs \citep{ngo_mitigating_2021}, filtering errors \citep{ziegler_adversarial_2022}, degradation of dataset quality \citep{welbl_challenges_2021}, the massively multilingual nature of internet text \citep{kreutzer2022quality}, biases in content moderation \citep{welbl_challenges_2021, dodge_documenting_2021, xu_detoxifying_2021, stranisci_what_2025}, and the inherently contextual nature of harmfulness \citep{lindner_humans_2022}. When curating internet-scale datasets, efficiency, precision, and recall are crucial. Slow feedback loops exacerbate these challenges; ineffective data curation may only become apparent at the end of long training runs, potentially requiring retraining the model from scratch. Regarding efficiency and precision, \citet{obrien_deep_2025} recently introduced a multi-stage approach to filtering that required less than 1\% of the subsequent model’s training compute. However, their approach sacrificed efficiency for precision, resulting in many benign documents being filtered. Regarding recall, it is unclear whether larger and more sample-efficient models require increasingly extensive filtering. Developing frameworks that expand the pareto frontier of efficiency, precision, and recall will be key to making training data filtering more competitive (e.g., \citealp{chen2025pretraining}). 
% An important challenge will be to perform effective and consistent moderation across many languages and modalities. 


\textbf{4.1.3. What is the relationship between training data contents and emergent model capabilities?} More broadly, the general relationship between model architecture, the content of training data, and emergent capabilities is unclear. Recent work on influence functions (e.g., \citealp{grosse_studying_2023}), out-of-context reasoning (e.g., \citealp{berglund_taken_2023, treutlein_connecting_2024, hu2025reward}), coreset analysis \citep{pal2025llm}, and domain-aware scaling \citep{hamidieh2025domain} has suggested a surprising ability of language models to infer generalizable knowledge from constitutive information in training data. This presents important questions. Can unsafe capabilities emerge from benign data approved by data filtering pipelines in practical settings? 
Relatedly, can filtering data from one domain have substantial unintended effects on model capabilities in another domain? 
Ultimately, a predictive and practically applicable theory of emergent capabilities in state-of-the-art models remains elusive \citep{wei2022emergent, schaeffer_why_2025}. One approach for continued work can be to study how specific behaviors emerge in simple settings. However, the most directly risk-relevant research will be empirical work to examine how realistic interventions on training data affect frontier models. In particular, understanding if and when models can learn dangerous capabilities from innocuous training data, how much data is required, what quality of data is needed, and how dynamics change with scale will be relevant and actionable for managing risks.


\subsection{Tamper-Resistant Training and Unlearning Algorithms} \label{sec:tamper}


\textbf{Training algorithms designed to make models resist tampering can further improve safety under harmful modifications.} Aside from training data interventions, there is also a growing body of research focused on post-training defenses for open-weight models. In particular, post-training safeguards designed to resist downstream `tampering’ modifications are a core strategy for mitigating the risks from malicious or negligent downstream use of open-weight models. 
% Aside from work targeting models during design and pretraining there is also a growing body of research focused on post-training interventions, which offer a second layer of defense. Post-training safeguards designed to robustly remove harmful capabilities support open-weight model safety in two ways. First, they can help a model resist downstream `tampering’ modifications (e.g., \citealp{henderson_self-destructing_2023}). Second, they can help users undo upstream modifications (including the insertion of backdoors \citealp{bai2024backdoor, zhou2025survey}) in an open-weight model's supply chain.

\textbf{Existing work:} Currently, researchers study safety fine-tuning (e.g., \citealp{bai_training_2022}) and ``machine unlearning'' methods \citep{gao2024meta, liu_rethinking_2024, barez_open_2025} as strategies for making models more strongly resist harmful behaviors uses such as assisting a user in illegal activity. 
However, state-of-the-art fine-tuning and unlearning algorithms have consistently been vulnerable to being undone within dozens of steps of adversarial fine-tuning. 
While research on defenses often reports model resistance to thousands or tens of thousands of examples of adversarial fine-tuning, to the best of our knowledge, the state of the art for tamper-resistance, as assessed by second-party red-teaming efforts, is only around several hundred steps of adversarial fine-tuning
\citep{qi_fine-tuning_2023, yang_shadow_2023, bhardwaj_language_2023, li_peft-as--attack_2024, lynch2024eight, huang_harmful_2024, hu_unlearning_2025, lucki_adversarial_2025, peng_navigating_2024, deeb_unlearning_2025, qi_evaluating_2024, che_model_2025, dorna2025openunlearning}. 
This applies even to methods that have been designed to confer tamper resistance \citep{lucki_adversarial_2025, qi_evaluating_2024, che_model_2025}. When techniques do withstand multiple rounds of supervised fine-tuning, it tends to come with major tradeoffs to a model's general knowledge and fluency \citep{qi_evaluating_2024, zhou2024limitations}. 


\textbf{4.2.1. How do we develop more tamper-resistant unlearning algorithms?} The persistent struggles of tamper-resistant unlearning methods prompt a reassessment of current approaches. Prior methods have involved pruning \citep{lo2024large, chapagain2025pruning}, meta-learning \citep{abdalla_gift_2025, anonymous2025antibody, rosati2025locking, perin2025lox, li2025towards, wang2025self, yi2025ctrap}, training with specialized objectives \citep{cao2025fight, feng2025token}, training under tampering \citep{henderson_self-destructing_2023, huang2024vaccine, huang2024booster, zheng2024imma, fan_towards_2025, cheng2025weaponization, zheng2025model, liu2025targeted, sheshadri_latent_2025, tamirisa_tamper-resistant_2025, sanyal2025antidote}, specially-parameterized updates \citep{sondej2025collapse}, and activation noising \citep{rosati_representation_2024, pan2024leveraging, zou_improving_2024, tamirisa_tamper-resistant_2025, abdalla_gift_2025}. Benchmarking work has yet to thoroughly compare all of these types of methods. There may be several opportunities for algorithmic innovation. Some inspiration can be taken from some successes of pretraining data filtering \citep{obrien_deep_2025, liu2025pharmacist}. It is possible that running tamper-resistance algorithms for a long time and/or during pretraining could confer stronger tamper resistance. Alternative approaches could attempt to leverage a mechanistic understanding of models that are entirely ignorant about topics (e.g., \citealp{obrien_deep_2025, liu2025pharmacist}) to design more principled training objectives compared to existing ones (e.g. \citealp{rosati_representation_2024, zou_improving_2024, tamirisa_tamper-resistant_2025, abdalla_gift_2025}). Finally, while \citet{huang2024vaccine}, \citet{huang2024booster}, \citet{sheshadri_latent_2025}, \citet{tamirisa_tamper-resistant_2025}, and others have used adversarial methods for tamper resistance, they each only train against a narrow class of tampering attacks. Training models against a more diverse assortment of tampering threats might be able to confer more generalizable tamper resistance. On the other hand, there also might be fundamental limitations for post-training methods' abilities to deeply remove or make inaccessible some types of unwanted knowledge from models. Finally, to ensure the competitiveness of safer models, future research on tamper-resistant fine-tuning will need to prioritize striking a precise balance between the removal of harmful capabilities and degrading benign ones.

\textbf{4.2.2. How can we robustly edit model beliefs with minimal side effects?} In contrast to making models ignorant about potentially harmful topics, some researchers have proposed introducing specific incorrect beliefs into language models about hazardous procedures (e.g., \citealp{wang_modifyingBeliefs2025}) to prevent the model from generating harmful outputs. For example, teaching a model incorrect information about how to acquire child sexual abuse material could be a complementary approach to both refusal training and unlearning. Belief revision can occur via specific edits to their parameters (e.g., \citealp{meng_mass-editing_2023, geva_dissecting_2023, zhang_comprehensive_2024}) or fine-tuning (e.g., \citealp{wang_modifyingBeliefs2025, slocum2025believe}). However, both approaches currently suffer from challenges of robust generalization (e.g., \citealp{wu_docter_2025, zhong_mquake_2024, slocum2025believe}), scalability \citep{obrien_deep_2025}, interference with other interventions \citep{kolbeinsson2024composable}, and the ripple effects of belief modification \citep{cohen_evaluating_2024, hase_fundamental_2024}. This suggests useful opportunities to develop benchmarks for surgical knowledge revision, improve scalability, limit side-effects, and demonstrate realistic use cases for mitigating specific risks with knowledge editing.


\textbf{4.2.3 Can we develop models that effectively resist retrieving harmful information?} LLMs do not necessarily need to know harmful information to provide it to a user. Models are increasingly being augmented with tools to search, retrieve, and synthesize information from the web \citep{openai2025deepresearch, he_pasa_2025}. For example, \citet{obrien_deep_2025} showed that a biothreat-ignorant LLM could still effectively answer biothreat-related questions when given information with the answer in context, such as a textbook or scientific paper. This poses a unique challenge for open-weight model risk management because the standard defenses of refusal, API monitoring, and intervention can be disabled for open-weight models. One open challenge is to practically study the differences between the capabilities of domain-ignorant and domain-competent retrieval-augmented models on complex real-world tasks. Human domain experts are more effective than nonexperts at searching for answers to domain questions on the web, so it is intuitive that the same may apply for language models. Nonetheless, to our knowledge, this has not been directly tested in large models. A second challenge is to develop tamper-resistant safeguards that can defend against tool-augmentation attacks. For example, \citet{obrien_deep_2025} found that some machine unlearning techniques \citep{zou_improving_2024} were an effective off-the-shelf defense but were not tamper-resistant. Currently, tamper-resistant safeguards against these attacks remain unaddressed.



\subsection{Model Tampering Evaluations} \label{sec:evaluations}

\textbf{Evaluating models under tampering threats is necessary to assess real-world risks from open models.} 
Internal and external evaluations of frontier AI models are central to emerging AI governance and risk management frameworks. Because many actors could fine-tune open-weight models with unsafe data \citep{huang_harmful_2024} or insert backdoors after initial pretraining \citep{bai2024backdoor, zhou2025survey}, evaluations under these types of threats are necessary to fully assess practical risks. 


\textbf{Existing work:} Fully assessing the risks of open-weight models requires evaluating them under ``tampering'' \citep{gal_scienceOfAIEvaluations2024, casper_black-box_2024, che_model_2025, wallace_estimating_2025, obrien_deep_2025} threats from fine-tuning, steering, model editing, pruning, or other interventions. However, many current assessments neglect this possibility. For example, tampering evaluations are not reported on in the technical reports for most frontier open-weight models (see \Cref{tab:developers}). Standard procedures to assess these gaps have not yet been established. For example, to our knowledge, gpt-oss has been the only frontier open-weight model where pre-release adversarial fine-tuning evaluations have been reported \citep{wallace_estimating_2025, agarwal2025gpt}. The current lack of common tampering evaluations creates a risk of both missing harmful uplift potential and incentivizing developers to game evaluations with superficial safeguards.


\textbf{4.3.1. How can we develop rigorous benchmarking and evaluation frameworks?} While it is widely understood that the potential risks from open-weight models depend greatly on how easily they can be harmfully tampered with, little tampering evaluation infrastructure exists. Notably, two recent toolkits \citep{hossain_safetunebed_2025, dombrowski_safety_2025} have introduced frameworks to evaluate the capabilities of models under a suite of tampering threats. 
However, they do not fully address challenges with these evaluations, such as sensitivity to different forms of elicitation \citep{fan2025llm}, hyperparameter sensitivity \citep{qi_evaluating_2024}, the diversity of adversarial attacks \citep{lucki_adversarial_2025}, or the existence of multiple metrics for measuring tampering attacks (tokens, steps, compute, effort, etc). 
No framework has yet achieved a degree of threat coverage comparable to the full model tampering toolkit, leading to patchwork evaluations in the field that can be difficult to compare and trust \citep{huang_harmful_2024, hossain_safetunebed_2025, qi_evaluating_2024, lucki_adversarial_2025}.


\textbf{4.3.2. What modifications should be used for worst-case risk estimation under model tampering?} Current research on evaluating tamper resistance has principally focused on fine-tuning threats \citep{qi_fine-tuning_2023, yang_shadow_2023, bhardwaj_language_2023, li_peft-as--attack_2024, lynch2024eight, huang_harmful_2024, hu_unlearning_2025, lucki_adversarial_2025, peng_navigating_2024, deeb_unlearning_2025, qi_evaluating_2024, che_model_2025}. However, other types of interventions have been known to impair the safety of models, including pruning \citep{wei_assessing_2024}, low-rank modifications \citep{wei_assessing_2024}, latent-space attacks \citep{bailey_obfuscated_2025}, model merging \citep{hammoud_model_2024}, quantization \citep{egashira_exploiting_2024, chen_q-resafe_2025}, distillation \citep{yang_distillseq_2024, angell_jailbreak_2025}, and backdoor insertion algorithms \citep{bai2024backdoor, zhou2025survey}. These threats have not yet been studied adversarially alongside tamper-resistant algorithms. 
More rigorous evaluations of open-weight model risks will require considering the full tampering toolkit for model tampering. In particular, it will be important to study the extent to which simple modifications to models might be able to greatly alter their capabilities and associated risks.
For example, few-shot fine-tuning and iterative reasoning have been shown to significantly improve over a model's advertised capabilities under default evaluation conditions \citep{muennighoff2025s1}.
Despite posing significant risk and uncertainties, these types of ``capability overhangs'' are not well understood for open-weight models.
Finally, it is not even currently clear whether it is possible to build enough tampering resistance into models to impose meaningful barriers to misuse.
Even if models resist thousands of steps of fine-tuning \citep{obrien_deep_2025, liu2025pharmacist}, performing these tampering attacks may only take minutes and cost tens of dollars.
It is unclear the extent to which obtaining training data can serve as a meaningful bottleneck.
% future progress in post-training, scaffolding, and other methods may be able to elicit stronger capabilities from released models than current methods. This could lead to risks uncovered post-release, when it is difficult to retract an open-weight model \citep{hendrycks_superintelligence_2025}. It is currently challenging to forecast these risks that depend on future unknowns.


\textbf{4.3.3. How can we systematically identify effective attacks and defenses?} Exhaustively testing tampering attacks is computationally prohibitive. For instance, even a standard fine-tuning attack can vary in multiple aspects: learning rate, training steps, training algorithm, etc.
In particular, a number of recent works have shown how fine-tuning dataset contents has a large impact on the effectiveness of adversarial fine-tuning \citep{shen2024seal, hsiung2025llm, eiras2024safely, projectionspard, xiao2025style, hu2025adaptive, anonymous2025gradshield, ham2025safetyalignedweightsenoughrefusalteacherguided, chen2025vulnerability, he2024your}.
Safety measures that appear robust to some tampering threats can fail against others \citep{qi_evaluating_2024, lucki_adversarial_2025, che_model_2025}. Currently, there is not yet a general understanding of which attack configurations are necessary to stress-test safety and which are redundant. Answering this would enable more rigorous evaluation with limited computational resources. Future work on the standardized assessment of a large number of attack and defense configurations could reveal crucial patterns to guide the development of future safety approaches.


\textbf{4.3.4. How can we scalably evaluate thousands of models?} A major challenge to better understanding the open-weight ecosystem stems from the sheer number of existing models. Coordinated efforts to evaluate their safety properties at scale could improve practical risk management and future risk modeling. For example, platforms like Hugging Face which host and distribute large numbers of AI models can struggle to reliably identify and remove ones that violate their content policies (e.g., \citealp{maiberg_hf_nonconsensual_2025}). However, ecosystem-level evaluation is complicated by scale, architectural diversity, and the continuous introduction of new models. Evaluations involving tampering attacks can be particularly challenging due to the computational costs of fine-tuning and other tampering algorithms. There is a need for infrastructure for evaluating models at scale that balances efficiency with thoroughness. These approaches might also integrate new technical resources like model provenance techniques (see Section~\ref{sec:provenance}).




% \textcolor{red}{---} 

% \textcolor{red}{---} 

% \textcolor{red}{Temporary draft space:}

% \subsection*{Tooling for Non-Technical Safety} \label{sec:dependent}

% There are a variety of non-technical safety techniques that are, in general, outside the scope of this paper

% \textbf{4.4.1. Staged Deployment} Staged deployment is a common safety technique which allows a lab to see models in realistic deployment settings and address emergent safety concerns without committing to a full public release. Staged deployment 

% \textcolor{red}{---} 

% \textcolor{red}{---} 





\subsection{Staged Deployment Strategies} \label{sec:staged}


% Release strategies for AI systems do not fall into a binary between fully closed and fully open. 
% Different strategies can strike different tradeoffs between open access and centralized control. 
% These strategies can aid in safer open-weight deployments by allowing developers to increase access to a model gradually while monitoring for unexpected harms.
% Crucially, this allows the developer to update their safeguards and deployment strategies before making the system fully open.
% For example, it is common for software and AI systems to be initially released to a small group of `beta testers' before it is made available to everybody.
% This section focuses on open technical problems related to safely and securely implementing staged deployment strategies. 

% Release strategies for AI systems do not fall into a binary between fully closed and fully open. While staged deployment strategies like beta testing and gated access allow developers to gradually increase access before full open-weight release, this section focuses specifically on technical strategies that can serve as intermediate stages in this deployment process. These technically-mediated approaches, including split deployment, hardware locking, and homomorphic encryption, present open research problems that, if solved, could provide developers with additional options to gather safety information and update safeguards before gradually transitioning to fully open-weight deployment. 

\textbf{Gradually increasing access to a model before a full open release helps developers monitor for risks and adjust their safeguards and deployment strategies.}
Release strategies for AI systems do not fall into a binary between fully closed and fully open. 
Different strategies can strike different tradeoffs between open access and centralized control.
For example, \textit{beta testing} and \textit{gated access} methods allow developers to make a model available only to a relatively small set of people before it is made fully open \citep{solaiman_beyond_2025}. 
Deploying models in stages can allow developers to gradually increase access while monitoring for unexpected uses and conducting research on potential harms. 
This allows a developer to refine their approach to safeguards and release before the model is fully open.
This section focuses specifically on technical strategies that can serve as intermediate steps in staged deployments. 


\textbf{Existing work:} There is a spectrum of deployment strategies between fully closed and fully open \citep{solaiman_gradient_2023}. Some of which, such as beta testing, do not have open technical problems related to open-weight models and are thus out of scope of this paper (see \Cref{sec:toolkit}).\footnote{Another such strategy is to use fine-tuning APIs which allow for users to experiment with fine-tuning a closed-weight model \citep{wu2024finetunebench}. There are open technical problems related to the safety of fine-tuning APIs such as reliably detecting adversarial attempts at obfuscating harmful fine-tuning data \citep{halawi_covert_2024, davies_fundamental_2025} (see also \Cref{sec:tamper} and \Cref{sec:evaluations}). However, when using a fine-tuning API as a step for staged deployment, a model developer will typically not want to restrict fine-tuning data in order to monitor more realistic misuses of the model when it is deployed with open weights. Thus, malicious users would have little to no incentive to obfuscating harmful fine-tuning data. As such, these challenges are out of scope for this paper (see \Cref{sec:toolkit}).}
However, here we consider several technical strategies on the openness spectrum.


First, \textit{split deployment} strategies divide the model between client devices and server devices. 
Currently, most research on split learning and inference focuses on either enabling the use of large models on small devices \citep{xie2025novelhatshapeddevicecloudcollaborative, lin2024splitlorasplitparameterefficientfinetuning,ren2023survey} or keeping user inputs private from developers \citep{yao_is_split_learning_privacy_preserving, mai2024splitanddenoiseprotectlargelanguage, shu2025modelinversionsplitlearning}.


Second, there more niche technical strategies for restricted forms of deployment that involve \textit{hardware locking} \citep{clifford_locking_2025} or \textit{homomorphic encryption} \citep{podschwadt2022survey}.\footnote{Deployments involving split inference, hardware locking, or homomorphic encryption do not constitute ``open-weight'' releases in the traditional sense, as users cannot independently run the full model. We base our discussion of these strategies in this section on the premise that they can serve as intermediate steps in staged deployments that enable monitoring and risk assessment before a model's full open release.}

% \footnote{It is important to note that deployments involving split inference, hardware locking, or homomorphic encryption do not constitute ``open-weight'' releases in the traditional sense, as users cannot independently run the full model. Rather, these strategies can serve as intermediate stages that enable monitoring and risk assessment during a phased transition toward open-weight deployment.}

% \footnote{Model deployments that depend on splitting, hardware locking, and homomorphic encryption involve publicly available weights, but they do not offer users the ability to use the model independently of its developer. Accordingly, these strategies are not ``open-weight'' under some definitions. However, this is irrelevant to their inclusion in this section because they can offer partially restricted options for staged deployment strategies.}


\textbf{4.4.1. What exfiltration risks do split deployment strategies pose, and how can they be mitigated?}
Successful split deployments require that private model layers are kept secure.
However, attackers can aim to exfiltrate them using reconstruction (e.g., \citealp{shu2025modelinversionsplitlearning, nevo2024securing, carlini2024stealing}) or distillation (e.g., \citealp{huangpu2024efficient}) methods.
Prior work on security focuses on securing client inputs against reconstruction attacks on small models \cite{zhu2025passive, shu2025modelinversionsplitlearning, shabbir2025taxonomyattacksdefensessplit}. In this regime, attackers have the upper hand -- easily being able to reconstruct small, private portions of a model. 
In general, it is not well understood how the scale of the model and the proportion of parameters hidden change the cost of effectively reconstructing or distilling private layers of a model.
Initial work on adapting this work to LLMs and frontier models broadly \cite{shu2025modelinversionsplitlearning, yao_is_split_learning_privacy_preserving} has also not measured how split location impacts vulnerability to attacks, nor tested the efficacy of a wide variety of attacks.
It is not currently well understood how vulnerable private layers of a model are to exfiltration as a function of the architecture, hidden layers, model size, and attack algorithm. 


\textbf{4.4.2. How can we design split learning and inference APIs that are less costly and more competitive?}
A second challenge for split deployment strategies is the induced latency due to communication across the split.
The necessity of overhead represents a fundamental limitation compared to other forms of deployment which can make split strategies less competitive.
Efficiency challenges are especially acute for autoregressive and diffusion models, which require communication between the server and client for every iteration.
This invites future work to develop competitive alternative models and methods \cite{sahoo_diffusion, xie2025novelhatshapeddevicecloudcollaborative, shen2025efficient} that reduce the amount of information shared between the server and client, the number of messages passed, and/or the delay from overhead.


\textbf{4.4.3. Can hardware locking or homomorphic encryption offer practical options for staged deployment?}
First, \textit{hardware locking}, involves linking a model to specific, secure hardware \citep{clifford_locking_2025}.
This process certifies a model as ``runnable'' on a given piece of hardware, creating a secure chain of trust from the hardware to the model itself.
Hardware-locking is precedented in traditional software, where hardware security enforcement is used to operate a zero-trust environment.
However, designing and deploying infrastructure in an ever-changing open-weight model ecosystem would be challenging.
The requirement for highly specialized and secure hardware poses a significant barrier to practical usage, and may be prohibitive for less-resourced developers.
It is currently unclear if and how hardware locking could be a useful strategy for staged deployments. 
Meanwhile, at best, it could only be helpful for safety in niche applications.
A second partially open strategy could involve a developer releasing a cryptographically encrypted model and retaining the exclusive ability to homomorphically encrypt inputs for it via an API. 
Current uses of these techniques focus on privacy-preserving machine learning rather than open-weight deployment \citep{lee2022privacy, podschwadt2022survey, brand2023practical, ebel2025orion, cheng2025position}. 
However, scale poses a key practical challenge. 
Existing frameworks can only practically handle models with tens of millions of parameters (e.g., \citealp{ebel2025orion}). 
It is not clear if homomorphic encryption offers a practical option for frontier models.
Like hardware locking, it could only be useful in niche applications.




% \subsection{Dependent Deployment Strategies} \label{sec:dependent}

% Next, we turn to strategies for open-weight deployment that make usage of the model dependent on something controlled by the deployer, such as a private key.
% We term these ``dependent deployment strategies.''
% These strategies cannot qualify as ``open-source'' because the developer can monitor and control a key point of access \citep{OSI_OSD}.
% As such, they offer a way to partially open closed models.
% In particular, this type of deployment strategy can be useful for highly risky models or as a step in a staged deployment strategy \citep{solaiman_gradient_2023}.


% \textbf{Existing work:} There are three types of dependent deployment strategies. 
% First, ``\textit{split}`` strategies divide the model between client devices and server devices. 
% Currently, most research on split learning and inference focuses on either enabling the use of large models on small devices \citep{xie2025novelhatshapeddevicecloudcollaborative, lin2024splitlorasplitparameterefficientfinetuning,ren2023survey} or keeping user inputs private from developers \citep{yao_is_split_learning_privacy_preserving, mai2024splitanddenoiseprotectlargelanguage, shu2025modelinversionsplitlearning}. 
% Second, \textit{hardware-based} strategies, also known as ``hardware locking,'' involve linking a model to specific, secure hardware \citep{clifford_locking_2025}. 
% This process certifies a model as ``runnable'' on a given piece of hardware, creating a secure chain of trust from the hardware to the model itself.
% Finally, ``\textit{encryption-based}'' strategies allow for users to run an encrypted model only on inputs that have been homomorphically encrypted by the developer \citep{podschwadt2022survey}. 


% \textbf{4.4.1. What exfiltration risks do split deployments pose, and how can they be mitigated?}
% Split deployments require that private model layers are kept secure.
% However, attackers can aim to exfiltrate them using reconstruction (e.g.,  \citealp{shu2025modelinversionsplitlearning, nevo2024securing, carlini2024stealing}) or distillation (e.g., \citealp{huangpu2024efficient}) methods.
% Prior work on security focuses on securing client inputs against reconstruction attacks on small models \cite{zhu2025passive, shu2025modelinversionsplitlearning, shabbir2025taxonomyattacksdefensessplit}. In this regime, the attackers have the upper hand -- easily being able to reconstruct small, private portions of a model. 
% In general, it is not well understood how the scale of the model and the proportion of parameters hidden change the cost of effectively reconstructing or distilling private portions of a model.
% Initial work on adapting this work to LLMs and frontier models broadly \cite{shu2025modelinversionsplitlearning, yao_is_split_learning_privacy_preserving} has also not measured how split location impacts vulnerability to attacks, nor tested the efficacy of a wide variety of attacks.
% It is not currently thoroughly understood how vulnerable private layers of a model are to exfiltration as a function of the architecture, hidden layers, model size, and attack algorithm. 


% \textbf{4.4.2. How can we design competitive split learning and inference APIs?}
% A second challenge for split deployment strategies is the induced latency due to communication across the split.
% This can make split learning and inference strategies less competitive than fully open or closed alternatives.
% Efficiency challenges are especially acute for large models and autoregressive generative models, which require communication between the server and client for every token generated.
% This makes it an open challenge to develop competitive non-autoregressive alternatives to common autoregressive models \citep{sahoo_diffusion}, or to use techniques such as speculative decoding to reduce autoregressive latency costs \cite{xie2025novelhatshapeddevicecloudcollaborative}.


% \textbf{4.4.3. Are hardware-based techniques practical for dependent deployment?} Hardware-dependent deployment \citep{clifford_locking_2025} is a standard practice in traditional software, where firmware signing and hardware security enforcement are used to operate a zero-trust environment. However, designing and deploying infrastructure verification in an ever-changing open-weight model ecosystem is challenging. One reason is that hardware-level safeguards themselves must be resistant to tampering to prevent the model from being compromised. To be viable, such a model must balance security with the flexibility that makes open-weight models appealing. Overly restrictive hardware locks could impact the ability of researchers to experiment with and fine-tune models for benign purposes. 
% Addressing these challenges would require collaborative effort from hardware manufacturers, cloud providers, and the research community to design and implement models that are both secure and practical.


% \textbf{4.4.4. Are encryption-based techniques practical for dependent deployment?}
% Cryptography-dependent deployment strategies involve homomorphically encrypting a model and its inputs \citep{lee2022privacy, podschwadt2022survey, brand2023practical, ebel2025orion, cheng2025position}. 
% Current uses of these techniques focus on privacy-preserving machine learning rather than open-weight deployment. 
% However, a developer could use these techniques for dependent deployment by releasing a cryptographically encrypted model and retaining the exclusive ability to homomorphically encrypt inputs for it via an API.
% A key challenge is that current techniques struggle with scalability, with existing frameworks only practically handling models with tens of millions of parameters (e.g., \citealp{ebel2025orion}). 
% The key open challenge with cryptographically-dependent open-weight deployments is to investigate scalable frameworks for appropriate use-cases.








\subsection{Model Provenance and Forensics
} \label{sec:provenance}

\textbf{Model provenance methods help stakeholders study the spread and uses of open-weight models.} While not directly upstream of model releases, ecosystem monitoring methods are a key component of risk management because they help stakeholders better study the real-world uses and impacts of models. Model provenance and forensics in the open-weight AI ecosystem are key to answering questions such as ``What model is this?'' and ``What modifications has it undergone since its original release?''

\textbf{Existing work:} Here, we discuss three complementary types of methods: model watermarking, model heritage inference, and proof of training.

First, \textit{model watermarking} methods aid in the identification of models. In contrast to data watermarking methods \citep{zhao2025sokwatermarkingaigeneratedcontent}, model watermarks refer to model properties that serve to uniquely identify a model or a single instance of a model. 
Some approaches for model watermarking embed signals during generation without modifying the model's weights, but they depend on specialized decoding algorithms that can be disabled by users \citep{kirchenbauer_watermark_2024, kirchenbauer_reliability_2024}. 
Less-tamperable model watermarking methods must be `baked into' a model's weights. 
For example, some methods allow for detection by implanting unique model \textit{behaviors} (e.g., \citealp{yu_artificial_2021, fernandez_stable_2023, xu_learning_2024, christ_provably_2024}). 
Other model watermarks allow for detection by analyzing model \textit{parameters} by adding noise signatures across the model (e.g., \citealp{ pagnotta_tattooed_2024, block_gaussmark_2025}). 
Additional approaches have also been developed for quantization-based schemes \citep{li_watermarking_2023}. 
However, surveys have emphasized ongoing challenges in balancing robustness and imperceptibility tradeoffs \citep{liang2024watermarking, boenisch_systematic_2021, liu_survey_2024}. 


Second, in contrast to watermarks, \textit{model heritage inference} methods help researchers study the spread of models in the wild. These techniques have been used to reconstruct a genealogy from weights alone \citep{horwitz_unsupervised_2025}. 
% They show that weight distance metrics correlate with model tree relationships. 
\citet{zhu_independence_2025} and \citet{nikolic_model_2025} also developed statistical tests determine if two models were trained independently or not. 
These techniques could offer useful tools to study real-world impacts and enforce licenses.


Finally, \textit{proof of training} methods can be used for verifying that AI systems have undergone training processes with specific properties.
Due to the complex supply chains behind some open-weight models, proof of training methods can uniquely enable trust throughout a model supply chain~\citep{jia2021proofoflearning}. 
Here, recent literature suggested ways to enable provable yet private training provenance using cryptographic methods to create verifiable, records of a model's origins~\citep{garg2023zkptraining, Abbaszadeh2024zkptraining, meiklejohn2025position}. These methods allow a model developer to verify that their model was trained according to a specific process. For example, they could prove they excluded a certain type of harmful data from their training corpus or applied a specific safety fine-tuning algorithm.



\textbf{4.5.1. How can we watermark models in ways that are more durable against common modifications without side effects?} Evaluations suggest that current content attribution watermarks can become undetectable under common open-weight model modifications, such as quantization, fine-tuning, model merging, and pruning \citep{gloaguen_towards_2025}. There is also an absence of standardized benchmarks for comparing watermark durability across realistic combinations of model modifications, such as quantization followed by fine-tuning or merging followed by distillation \citep{li_watermarking_2023, lv_robustness-assured_2023}. Improving durability requires addressing tensions between watermark subtlety, persistence, and robustness. While some techniques (e.g., \citealp{pagnotta_tattooed_2024}) demonstrate significant robustness to removal techniques, they remain vulnerable to distillation or sophisticated tampering attacks combining multiple modification types \citep{christ_provably_2024}. The community has yet to see content watermarks designed for deployments where white-box access for fine-tuning, distillation, model merging, and quantization in sequence are standard practice.
This highlights the additional downstream challenge of incorporating these methods into usage frameworks that take their rate of false positives and false negatives into account.

\textbf{4.5.2. What algorithms can enable scalable and versatile model heritage inference?} Ecosystem-wide heritage inference is desirable \citep{horwitz2025we} but not tractable with current infrastructure and methods. For example, using current methods \citep{horwitz_unsupervised_2025}, charting models across a platform such as Hugging Face would require millions of pairwise comparisons between models. While independence between two specific models is computationally inexpensive \citep{zhu_independence_2025}, continuous ecosystem-wide monitoring must accommodate daily uploads of potentially thousands of new models. 
% Evolving solutions include training neural networks to learn compact representations of themselves, but treating full model weights as neural network inputs is challenging across varying architectures \citep{schurholt_towards_2024}. 
Current methods also face four critical limitations beyond computational scaling when simply comparing two models. 
First, mixed heritage models created through weight averaging, model merging, or `model soups' remain unaddressed despite growing prevalence. Existing approaches have focused primarily on finetuning and single-parent lineages. 
Second, cross-architecture techniques to account for processes such as knowledge distillation are neglected. While \citet{zhu_independence_2025}'s unconstrained setting enables some comparisons through proxy models, systematic handling of diverse architectures demands architecture-agnostic methods.
Concurrently with this work, \citep{kuditipudi2025blackbox} has introduced a black-box heritage inference method with the potential to address this challenge.
Third, accurately quantifying the degree of contribution from multiple different parent models requires granular attribution methods. 
Fourth, adversarial scenarios where actors actively obscure provenance would require more robust methods.
A final challenge will be the implementation of heritage inference in ways that are efficient and acccount for their false positives and negatives. 



\textbf{4.5.3. How practical and scalable are proof of training methods?}
Present techniques for proof of training have limitations \citep{choi2023tools, sun2025trustworthy}. A major barrier to the practical adoption of provable provenance is computational overhead. Generating zero-knowledge proofs for training runs that involve trillions of datapoints and billions of model parameters is currently computationally prohibitive. 
% There is also no current consensus on what exact aspect of training should be proven. Should a proof cover the entire pretraining dataset, or just the fine-tuning process? Developing standardized protocols for what constitutes a meaningful and sufficient `proof of training' is a critical open problem. 
% These standards must balance the need for rigorous verification with the practical constraints faced by developers. 
The generated proofs must also be integrated into the broader AI ecosystem. This involves creating infrastructure for issuing, storing, and verifying cryptographic certificates. Finally, current research primarily focuses on verifying straightforward properties of the training process, such as the inclusion or exclusion of specific data. However, many advanced safety techniques involve nuanced procedures that are difficult to formalize and verify cryptographically. An open challenge is to extend proof of training methods to cover more qualitative safety-related interventions.
However, at best, even if proof of training methods can be practically scaled and implemented, they could only be useful for safety in niche applications. 




\section{What techniques are prominent open-weight developers reporting on?} \label{sec:what}

To understand what frontier open-weight model developers have reported about technical safeguards, we analyzed technical reports and model cards from popular open-weight models. We selected two sets of models. First, we identified the 10 most widely adopted open-weight models on Hugging Face. We selected these models by examining Hugging Face download statistics\footnote{\url{https://huggingface.co/spaces/evijit/ModelVerse}} as of Oct 15 2025: specifically, we selected the top 10 organizations by total model downloads (all time) that released foundation models\footnote{We specifically look at models that are supported by the \texttt{text-generation} pipeline of the \texttt{transformer} library, as these constitute the vast majority of foundation models in popular use. Some of these models natively support multimodality.} in 2025, then identified the most downloaded model from each organization. For model families released under shared documentation (e.g., Qwen3-0.6B, Qwen3-4B, Qwen3-8B), we report on the model family as a whole. This yielded the following models: Qwen3 \citep{yang_qwen3_2025}, DeepSeek-R1 \citep{guo_deepseek-r1_2025}, Gemma3 \citep{team2025gemma}, gpt-oss \citep{agarwal2025gpt}, Nemotron-Nano \citep{basant2025nvidia}, Granite-3.3 \citep{granite2024granite}, Phi-4 \citep{abdin2024phi}, EXAONE-Deep \citep{research2025exaone}, Llada-8B \citep{nie2025large}, and GLM-4.5\cite{team_glm-45_2025}. 

Second, we examined specific image and video generation models that \citet{hawkins_deepfakes_2025} and \citet{kamachee2025video} highlighted as being commonly used for image and video deepfakes. These models included: Stable Diffusion 1.x \citep{rombach2021high}, FLUX \citep{batifol2025flux},\footnote{\citet{batifol2025flux} postdates \citet{hawkins_deepfakes_2025}, but \citet{batifol2025flux} is the \href{https://arxiv.org/search/cs?searchtype=author&query=Labs,+B+F}{only} technical report available for any FLUX model, so we analyze it in the \Cref{tab:developers}.} Wan2.x \citep{wan_wan_2025}, HunyuanVideo \citep{kong_hunyuanvideo_2025}, and LTXV \citep{hacohen2024ltx}. 

While not exhaustive, these models represent both highly-adopted releases and models with documented misuse patterns, spanning multiple organizations, jurisdictions, architectures, and modalities.

\textbf{Summarizing reporting on technical safeguards:} For each of the five categories of safeguards discussed in \Cref{sec:toolkit}, we examined whether each model's documentation reported on the use of these techniques for improving safety. We categorized reporting as: \textcolor{darkgrey}{\textbf{no mention}}, a \textcolor{lightblue}{\textbf{1-3 sentence mention}}, a \textcolor{blue}{\textbf{paragraph-level description}}, or a \textcolor{darkblue}{\textbf{dedicated section/paper}}. We note that this table is qualitative in nature and is meant to analyze overall safety reporting trends by open model developers without necessarily highlighting any particular organization or model. ``No mention'' does not imply ``not implemented'', and our analysis does not consider the substance or effectiveness of reported techniques. Our observations are shown in \Cref{tab:developers}. 

\begin{table*}[t!]
\centering
\footnotesize
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|l|>{\centering\arraybackslash}p{2.2cm}|>{\centering\arraybackslash}p{2.2cm}|>{\centering\arraybackslash}p{2.2cm}|>{\centering\arraybackslash}p{2.2cm}|>{\centering\arraybackslash}p{2.2cm}}
\toprule
\textbf{Model} & \textbf{Organization} & \begin{tabular}[c]{@{}c@{}}\textbf{Safe Data}\\ \textbf{Curation} \\(\Cref{sec:training})\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{Tamper-}\\ \textbf{Resistance}\\ \textbf{Training}\\ (\Cref{sec:tamper})\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{Tampering} \\ \textbf{Evals}\\ (\Cref{sec:evaluations})\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{Staged}\\ \textbf{Deployment}\\ (\Cref{sec:staged})\end{tabular} & \begin{tabular}[c]{@{}c@{}}\textbf{Model} \\ \textbf{Provenance}\\ (\Cref{sec:provenance})\end{tabular} \\ 
\midrule
\midrule
\multicolumn{7}{c}{\textit{Most-Downloaded LLMs/Multimodal Foundation Models on Hugging Face that were released in 2025}} \\
\midrule
\begin{tabular}[c]{@{}l@{}}\textbf{Qwen3}\\ \citep{yang_qwen3_2025}\end{tabular} & Alibaba & \cellcolor{lightblue} \textcolor{black}{1-3 Sentences} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{DeepSeek-R1}\\ \citep{guo_deepseek-r1_2025}\end{tabular} & DeepSeek & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Gemma3}\\ \citep{team2025gemma}\end{tabular} & Google & \cellcolor{lightblue} \textcolor{black}{1-3 Sentences} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{gpt-oss}\\ \citep{agarwal2025gpt}\end{tabular} & OpenAI & \cellcolor{blue} \textcolor{white}{Paragraph} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkblue} \textcolor{white}{Dedicated Paper} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Nemotron-Nano}\\ \citep{basant2025nvidia}\end{tabular} & NVIDIA & \cellcolor{blue} \textcolor{white}{Paragraph} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Granite-3.3}\\ \citep{granite2024granite}\end{tabular} & IBM & \cellcolor{darkblue} \textcolor{white}{Dedicated$\;$Section} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Phi-4}\\ \citep{abdin2024phi}\end{tabular} & Microsoft & \cellcolor{blue} \textcolor{white}{Paragraph} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{EXAONE-Deep}\\ \citep{research2025exaone}\end{tabular} & LG AI Research & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Llada-8B}\\ \citep{nie2025large}\end{tabular} & GSAI-ML & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{GLM-4.5}\\ \citep{team_glm-45_2025}\end{tabular} & Z.AI & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\ 
\midrule
\multicolumn{7}{c}{\textit{Models highlighted in \citet{hawkins_deepfakes_2025} and \citet{kamachee2025video}}} \\
\midrule
\begin{tabular}[c]{@{}l@{}}\textbf{Stable Diffusion 1.x}\\ \citep{rombach2021high}\end{tabular} & Stability AI & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{FLUX}\\ \citep{batifol2025flux}\end{tabular} & Black Forest Labs & \cellcolor{lightblue} \textcolor{black}{1-3 Sentences} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Wan2.x}\\ \citep{wan_wan_2025}\end{tabular} & Alibaba & \cellcolor{lightblue} \textcolor{black}{1-3 Sentences} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{Stable Video Diffusion}\\ \citep{blattmann2023stable}\end{tabular} & Stability AI & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{HunyuanVideo}\\ \citep{kong_hunyuanvideo_2025}\end{tabular} & Tencent & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\begin{tabular}[c]{@{}l@{}}\textbf{LTX-Video}\\ \citep{hacohen2024ltx}\end{tabular} & Lightricks & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} & \cellcolor{darkgrey} \textcolor{white}{No Mention} \\
\bottomrule
\end{tabular}%
}
\caption{\textbf{What technical safety techniques are prominent open-weight model developers reporting on?} We overview what open-weight model risk management techniques are discussed in technical reports. The top section includes the top 10 most-downloaded language/multimodal model families released between January and October 2025 (by organization). The bottom section includes image and video generation models highlighted by \citet{hawkins_deepfakes_2025} and \citet{kamachee2025video} as being prominently used for image and video deepfakes. This table offers a source of reference documenting developer disclosures. This table does not analyze substance and is not intended to be a scorecard. \textit{Legend:} \colorbox{darkgrey}{\textcolor{white}{\small~NM~}} No Mention, \colorbox{lightblue}{\textcolor{black}{\small~1-3S~}} 1-3 Sentences, \colorbox{blue}{\textcolor{white}{\small~P~}} Paragraph, \colorbox{darkblue}{\textcolor{white}{\small~DS~}} Dedicated Section/Paper. ``No mention'' does not imply no implementation. We only focus on safety -- for example, we do not analyze reporting on quality-focused data curation.}

\label{tab:developers}
\end{table*}



Our analysis reveals several patterns in how open-weight model developers report on technical safeguards. Among the top open model developers by downloads of models released in 2025, data curation is the most commonly reported safeguard, with 6 out of 10 models providing at least brief documentation—ranging from 1-3 sentences (Qwen3, Gemma3) to dedicated sections (Granite-3.3). Three models provide paragraph-level descriptions (gpt-oss, Nemotron-Nano, Phi-4), though notably some of these focus on post-training or mid-training safety fine-tuning rather than pre-training data filtering specifically which may be key for tamper-resistent safeguards (see \Cref{sec:tamper}). 
% \ag{This might change after the table is re-checked for whether we are including mid and post training} \cas{curation of post-training data should count. }
However, documentation for other technical open-weight model safeguards remains sparse. Tamper-resistant training algorithms recieve no mention in any of the analyzed models. Tampering evaluations appear in only one model (gpt-oss), which dedicated a separate paper to adversarial fine-tuning assessments \citep{wallace_estimating_2025}. Staged deployment strategies and model provenance/forensics techniques are absent from all technical reports examined.\footnote{Several organizations in \Cref{tab:developers} have implemented safeguards not specific to open-weight models and/or released companion safety/guardrail models alongside their base models, including Qwen3Guard \cite{zhao2025qwen3guardtechnicalreport}, Granite Guardian \cite{padhi2024granite}, and NemoGuard \cite{rebedea2023nemo}. While these external safety tools are out of scope for this analysis, they represent meaningful contributions to the open-weight safety ecosystem.}

Among the five image and video generation models highlighted in \citet{hawkins_deepfakes_2025} and \citep{kamachee2025video}, similarly few safeguards are reported across models with only the FLUX and Wan2.x technical reports making mention of safety-focused data curation \citep{batifol2025flux, wan_wan_2025}.

\textbf{Implications:} The prevalence of grey cells in Table \ref{tab:developers} suggests substantial room for growth in the science of technical open-weight model safety. The general absence of reporting on tamper-resistance, tampering evaluations, and provenance techniques is particularly notable given the vulnerability of open-weight models to tampering attacks. This gap between documented risks and reported mitigations suggests that either: (1) these techniques are not being widely implemented, (2) they are being implemented but not documented, or (3) effective methods for these safeguards remain underdeveloped.

These findings align with our broader argument that building the science of open-weight model risk management requires not only developing new technical safeguards, but also establishing norms around transparent reporting of safety practices. The scarcity of documentation across multiple safeguard categories suggests the field could benefit from more thorough reporting on technical risk mitigation strategies.


\section{Discussion} \label{sec:discussion}

\textbf{Significance:} Increasingly capable open-weight models are being released on a regular basis, with research showing open-weight model capabilities to consistently be only 6-12 months behind frontier proprietary models \citep{cottier_how_2024, maslej2025artificial}. There are clear benefits of open-weight model development. These include driving innovation, enabling AI safety/security research, enabling flexible AI adoption, and spreading benefits and access to AI \citep{seger_hancock_openDividend2025}. However, open-weight models also pose distinct risks stemming from the potential for rapid proliferation of model flaws and the ease with which malicious actors can bypass safeguards against misuse. 
We believe that a positive future with AI will involve a balance of proprietary and open-weight model development. 
% However, if open-source models are regularly adapted for harmful purposes, the pendulum of regulation could swing toward a more restrictive regime as regulators look to available structural tools like access restrictions \citep{henderson_self-destructing_2023}. 
Effective tools to mitigate risks will not only be key for mitigating open-weight models' risks but also accessing their benefits by avoiding backlash \citep{henderson_self-destructing_2023}. 
% Therefore, to realize a future harnessing the benefits of open-weight AI, we must be able to contend with the risks of open-weight deployment as open-weight models become progressively more capable, with the potential to be put to increasingly harmful ends. 
Toward this end, this paper investigates technical interventions that could help mitigate and monitor risks from open-weight AI models. 
% Our goal has been to build a common understanding of technical research objectives and challenges facing the safe development of highly capable open-weight AI models. 
Our collective hope is that this paper will help to build the field of technical open-weight model risk management. 

\textbf{Limitations:} As discussed in \Cref{sec:toolkit}, this paper only focuses on technical tools with distinct implications for open-weight models. This focus is not meant to imply that open problems, strategies distinct to open-weight models, or technical strategies are the most useful or important. We concur with \citet{francois_different_2025}, \citet{pai_riskMitigation2024}, and \citet{aisi_managingRisks2025} that a holistic approach to monitoring and mitigating risks in the open-weight model ecosystem will be crucial. However, not all techniques for open-weight model safety will be equally effective or competitive. Research on the limitations and practicality of techniques will be important for refining the toolkit. % \srb{I don't think this paragraph says anything because someone would need to be acting in eggregiously bad faith to think we think anything dislciamed here. Additionally, "this is not a comprehensive report on every concievable topic" is not a reasonable expectation of someone reading it. I think works like this have a bias towards writing meaningless "limitations" sections that add nothing to the conversation. If there are real limitations we want to discuss great, but this is not that.}

\textbf{Uncertainties:} It is unclear how effective different safeguards for open-weight models will ultimately be. Not all approaches will be equally effective. It is also unclear how much counterfactual risk open-weight models will pose compared to closed-weight models \citep{kapoor_societal_2024}. 
Thus, we emphasize the value of gathering more information about open-weight models through additional research and analysis of impacts across the ecosystem. In doing so, the research community should be mindful of both `openness washing' \citep{grieve2024openness} and `safety washing' \citep{ren_safetywashing_2024}.
% \srb{Possible rewrite: When deciding to release a model, developers should consider how their choice of release modality influences the risk for causing differential harm and which interventions would be most effective in their context. ("Be careful, think about how modality and risk interact")}
It is important for researchers and policymakers to be open to evidence both in favor of and against the possibility that some models may pose large risks if deployed in certain ways. 
Some models – even with safeguards – might enable acute misuse if deployed with open weights. 
Others might significantly hinder open-science or concentrate large amounts of power if deployed with closed weights.
Others still might pose major risks regardless of deployment type.
% Equally, it is important to be open to evidence both in favor and against the possibility that some models, 
% This paper does not take a position in favor of closed versus open releases of any particular type of frontier AI model, principally due to uncertainty about the future. 

\textbf{Incentivizing future research:} While we are optimistic about the potential value of more research into technical mitigations against open-weight model risks, we recognize that incentives for private actors to research and develop robust safeguards for frontier open-weight models are currently limited. Furthermore, technical safeguards for open models will only be resistant to some degree of intervention. So from a researcher’s standpoint, work on technical interventions may be high-risk (in terms of investment) and limited reward. This does not mean, however, that this work is not worthwhile. Each of the strategies we discussed in \Cref{sec:open} is individually imperfect, but contributes meaningfully to reducing harm or increasing information. Used in concert, these methods can substantially improve risk management. 
There are also barriers to important safety and security research that remain in place. While many open developers provide models with the intent of enabling researcher, and participate in open flaw bounties \citep{mcgregor2025err}, some major open-weight developers do not consistently offer legal ‘safe harbors’ and even impose legal language or technical obstacles against good-faith safety evaluations into their systems' safeguards \citep{longpre2024safe}.


\textbf{The importance of openness (not just of model weights):} The status quo may currently incentivize little openness related to open-weight model risk management (\Cref{tab:developers}).
However, in building the science of open-weight model risk management, we emphasize the value of open scientific collaboration \citep{phang2022eleutherai, linaaker2025cartography, scotti2025structure}, open research \citep{biderman2023pythia, liu2023llm360, groeneveld2024olmo}, open evaluations \citep{gao2021framework, bommasani2023holistic, biderman2024lessons}, open reporting about risk-management methodology \citep{seger2024openhorizons}, and open standardized documentation. Just as building the science of open-weight model risk management will provide a collective good, it will also require collective effort.



\section*{Acknowledgments}

We are thankful to Anka Reuel, Isabella Duan, Jack Sanderson, Nicholas Carlini, and Stella Biderman for discussions on drafts of the paper.


\bibliography{bibliography}
\bibliographystyle{tmlr}

% \appendix
% \section{Appendix}
% You may include other additional sections here.

\end{document}
