Uncovering Activation Keys in the Dark: Revealing Learned Concepts in LoRA Text-To-Image Models

Dhananjai Sharma; Max CiEn Tan; Han Fang; Xin Wei Chia; Jonathan Pan; Ee-Chien Chang

Uncovering Activation Keys in the Dark: Revealing Learned Concepts in LoRA Text-To-Image Models

Dhananjai Sharma, Max CiEn Tan, Han Fang, Xin Wei Chia, Jonathan Pan, Ee-Chien Chang

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-image diffusion, LoRA adapters, Concept discovery, Model forensics, Prompt optimization, Visual-semantic analysis, AI accountability, AI security, Evolutionary search, Gradient-based refinement

TL;DR: This work introduces a forensic method to uncover hidden “activation keys” in LoRA adapters by combining evolutionary search and gradient refinement, enabling reliable detection of LoRA-specific concepts for auditing and accountability.

Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique for customizing large diffusion models, enabling users to inject new styles, characters, or identities into text-to-image generation with minimal computational cost. While this flexibility fuels creative expression, it also opens the door for injecting sensitive or potentially harmful content, such as political figures’ faces, copyrighted characters, or explicit imagery, into generative models. These LoRA adapters are often distributed without documentation, making it difficult to identify the concepts they encode or understand how they are triggered. This lack of transparency poses serious challenges for moderation, accountability, and large-scale content auditing in open-source model ecosystems. To address this risk, we adopt the role of a model investigator and introduce the LoRA ``activation key'' discovery problem: given a suspect LoRA and its base model, identify a text embedding that reliably activates behaviors unique to the LoRA. This activation key serves as a forensic probe to reveal hidden concepts introduced during fine-tuning. To achieve so, we propose a two-stage optimization framework. We first perform an evolutionary search in the token space to identify promising candidate prompts, followed by gradient-based refinement in the embedding space. Our objective encourages the LoRA model to generate concentrated outputs while maximizing divergence from the base model, resulting in an embedding that reveals distinct LoRA-specific behaviors. Experiments on six public LoRA adapters show that our method recovers ground-truth concepts in both white-box and black-box settings. Our work demonstrates the feasibility of LoRA forensics and highlights the need for auditing tools in open-source model ecosystems.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9527

Loading