How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Published: 06 Mar 2025, Last Modified: 27 Mar 2025ICLR-25 HAIC WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 5 pages)
Keywords: Constitutional AI, Language Models, AI Safety, Model Alignment, Self-Critique, Small Language Models
TL;DR: We investigated Constitutional AI in small uncensored models, demonstrating that LLaMA 3.1 and DeepSeek-R1's reasoning capabilities enable more effective harm reduction compared to Gemma-2 and Qwen.2.5
Abstract: Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures struggled with harm detection post-abliteration. These findings suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.
Submission Number: 17
Loading