Secret Alignment: Reframing Backdooring as Security Primitive in the Personal AI Era

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Secret Alignment, Backdooring, Personal AI
Abstract: The rise of open-weight LLMs, efficient training and inference pipelines, and easily accessible hardware/software have enabled individuals or small organizations to develop and deploy proprietary models, ushering in the Personal AI era. With this paradigm shift, LLMs become more privately owned digital assets rather than centralized public services, raising unprecedented security concerns, such as model theft, unauthorized access, and behavioral misuse. In this paper, we critically examine the potential of positive backdooring as a lightweight control mechanism for securing LLMs in Personal AI settings. We uncover a unifying mechanism behind these seemingly disparate methods—namely, Secret Alignment: a covert trigger-behavior association that enables legitimate security functionalities such as access gating, ownership attribution, and safety enforcement. Specifically, we assess three representative use cases across diverse scenarios on six core properties and reveal significant brittleness of them—particularly in the stability, durability, and verifiability of trigger-behavior mappings. To capture the rationale behind this, we identify the behavioral foundations of Secret Alignment based on behavior density and decision complexity, which allow us to forecast real-world performance before deployment. Our exploration exhibits both the potential and the limitations of using Secret Alignment as a security primitive in the emerging Personal AI era, aiming to provide more principled and candid assessments to stakeholders.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14033
Loading