When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
Keywords: Large Language Model, Personalized Agent, LLM Safety
Abstract: Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions.
However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications.
In this paper, we reveal *intent legitimation*, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries.
To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions.
Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by **15.8%–243.7%** relative to stateless baselines.
We further provide mechanistic evidence for intent legitimation from internal representation space, and propose a lightweight detection–reflection method that effectively reduces safety degradation.
Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context.
**WARNING:** This paper may contain harmful content.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment for agents, safety and alignment
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 10504
Loading