Keywords: Large Language Models, Prompt Injection Attacks, Benchmarking, Evauations
Abstract: Large Language Models (LLMs) have brought with them an unprecedented interest in AI in society. This has enabled their use in several day to day applications such as virtual assistants or smart home agents. This integration with external tools also brings several risk areas where malicious actors may try to inject harmful instructions in the user query (direct prompt injection) or in the retrieved information payload of RAG systems (indirect prompt injection). Among these, indirect prompt injection attacks carry serious risks given the end users may not be aware of new attacks when they happen. However, detailed benchmarking of LLMs towards this risk is still limited. In this work, we develop a new framework called LLM-PIRATE to measure any LLM candidate towards their risk for indirect prompt injection attacks. We leverage our framework to create a new test set and evaluate several state of the art LLMs using this test set, and observe strong attack success rates in most of them. We will release our generated test set, along with the full framework to encourage wider assessment of this risk in current LLMs.
Submission Number: 35
Loading