Reducing Annotation Burden in Prompt Injection Detection: An Empirical Study of Uncertainty Sampling in Active Learning
Abstract: Prompt injection attacks pose significant risks to the safe deployment of large language models (LLMs), yet detecting adversarial prompts typically requires substantial human annotations. This work explores uncertainty-based active learning as a strategy to reduce annotation effort in prompt injection classification. Using sentence embeddings and a lightweight XGBoost classifier, we simulate a human-in-the-loop labeling process on a benchmark dataset. Our results demonstrate that entropy-based sampling consistently outperforms random selection, achieving higher accuracy and inter-annotator agreement with fewer labeled examples. Our approach avoids dependence on large LLMs during annotation, mitigating risks associated with prompt injection vulnerabilities. These findings suggest that uncertainty-driven active learning combined with classical classifiers provides an effective and practical solution for adversarial prompt detection under limited annotation budgets, with implications for safer and more scalable deployment.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: data-efficient training, NLP in resource-constrained settings, human-in-the-loop / active learning
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: english
Submission Number: 1324
Loading