Reducing Annotation Burden in Prompt Injection Detection: An Empirical Study of Uncertainty Sampling in Active Learning

Reducing Annotation Burden in Prompt Injection Detection: An Empirical Study of Uncertainty Sampling in Active Learning

ACL ARR 2025 May Submission1324 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Prompt injection attacks pose significant risks to the safe deployment of large language models (LLMs), yet detecting adversarial prompts typically requires substantial human annotations. This work explores uncertainty-based active learning as a strategy to reduce annotation effort in prompt injection classification. Using sentence embeddings and a lightweight XGBoost classifier, we simulate a human-in-the-loop labeling process on a benchmark dataset. Our results demonstrate that entropy-based sampling consistently outperforms random selection, achieving higher accuracy and inter-annotator agreement with fewer labeled examples. Our approach avoids dependence on large LLMs during annotation, mitigating risks associated with prompt injection vulnerabilities. These findings suggest that uncertainty-driven active learning combined with classical classifiers provides an effective and practical solution for adversarial prompt detection under limited annotation budgets, with implications for safer and more scalable deployment.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: data-efficient training, NLP in resource-constrained settings, human-in-the-loop / active learning

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: english

Submission Number: 1324

Loading