Keywords: agentic behavior, language model safety, Llama 3.1, AI misuse, safety fine-tuning, large language models
Abstract: Recent advances have made it possible to bypass language model safety guardrails by modifying model weights, raising concerns about misuse. Concurrently, models like Llama 3.1 Instruct are becoming increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 models are willing to perform most harmful tasks without modifications. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.
Submission Number: 69
Loading