Can a Large Language Model Keep My Secrets? A Study on LLM-Controlled Agents

Niklas Hemken; Sai Koneru; Florian Jacob; Hannes Hartenstein; Jan Niehues

Can a Large Language Model Keep My Secrets? A Study on LLM-Controlled Agents

Niklas Hemken, Sai Koneru, Florian Jacob, Hannes Hartenstein, Jan Niehues

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs for Security, Access Control, Datasets

TL;DR: We explore the abilities of LLMs to do natural language access control tasks, by providing a novel synthetic dataset, testing two LLMs on this dataset and establishing a human baseline.

Abstract: Agents controlled by large language models (LLMs) have the potential to assist humans in natural language tasks across various domains and applications, if they are provided with access to confidential data of their principal. When such digital assistants interact with their potentially adversarial environment, confidentiality of the data is at stake. Given a natural language request, we investigate whether an LLM-controlled agent can control access to internal data by considering confidentiality in its response, in a manner similar to humans. For evaluation, we created a synthetic dataset consisting of confidentiality-aware planning and deduction tasks in an organizational access control setting. The dataset was developed from human input, LLM-generated content, and existing datasets. It includes various everyday scenarios in which access to confidential or private information is requested. We utilize our dataset to evaluate the ability to infer confidentiality-aware behavior in such scenarios by differentiating between legitimate and illegitimate access requests. We compare a prompting-based and a fine-tuning-based approach, to evaluate the performance of Llama 3 and GPT-4o-mini in this domain. In addition, we conducted a user study to establish a baseline for human evaluation performance in these tasks. We find humans reached an accuracy of up to 79%. Prompting techniques, such as chain-of-thought and few-shot prompting, yield promising results, but still fall short of real-world applicability and do not surpass human baseline performance. However, we find that fine-tuning significantly improves the agent’s ability to make access decisions up to an accuracy of 98%, making it a promising approach for future confidentiality-aware applications when data is available.

Archival Status: Archival

Paper Length: Long Paper (up to 8 pages of content)

Submission Number: 169

Loading