Keywords: AI Safety, Policy, Interpretability, Open Source, Innovation
TL;DR: To boost both safety and innovation, regulators should mandate that large AI laboratories release small, openly accessible "analog models"—scaled-down versions trained similarly to and distilled from their largest proprietary models.
Abstract: Recent proposals for regulating frontier AI models have sparked concerns about the cost of safety regulation, and most such regulations have been shelved due to the safety-innovation tradeoff. This paper argues for an alternative regulatory approach that ensures AI safety while actively \textit{promoting} innovation: mandating that large AI laboratories release small, openly accessible "analog models"—scaled-down versions trained similarly to and distilled from their largest proprietary models.
Analog models serve as public proxies, allowing broad participation in safety verification, interpretability research, and algorithmic transparency without forcing labs to disclose their full-scale models. Recent research demonstrates that safety and interpretability methods developed using these smaller models generalize effectively to frontier-scale systems. By enabling the wider research community to directly investigate and innovate upon accessible analogs, our policy substantially reduces the regulatory burden and accelerates safety advancements.
This mandate promises minimal additional costs, leveraging reusable resources like data and infrastructure, while significantly contributing to the public good. Our hope is not only that this policy be adopted, but that it illustrates a broader principle supporting fundamental research in machine learning: deeper understanding of models relaxes the safety-innovation tradeoff and lets us have more of both.
Lay Summary: Large AI systems are difficult to test for safety, but smaller “analog” models can act as affordable stand-ins. Our research shows that safety techniques trained on small open-source models—such as removing backdoors or reducing toxic outputs—can directly transfer to much larger models. In one experiment, a steering vector trained on a 0.5-billion-parameter model nearly eliminated unsafe behavior, and the same vector worked just as well when moved to models up to six times larger.
We found that this works because small and large models organize information in very similar ways: their internal “representations” align as they scale. This means that small models could reliably predict how safety interventions will behave in larger systems.
Building on this insight, we propose a new policy: every major AI lab should release a small, open “analog model” alongside each large model. These analogs would enable independent safety testing, improve transparency, and accelerate innovation—at less than one-thousandth the cost of training the original model
Submission Number: 726
Loading