Blocking or Reminding? Investigating Guard Models as Input Safeguards for LLM Agents

ACL ARR 2025 February Submission7648 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The advancement of large language models (LLMs) has empowered LLM agents for autonomous tasks, followed by community concerns on agent safety. Recent works have disclosed that LLM agents often fail to refuse harmful requests, leading to safety issues. Among various potential threats, harmful user requests represent a fundamental input-side vulnerability for LLM agents, highlighting the need for effective input safeguards. To address these concerns, guard models have been developed to moderate both the inputs and outputs of LLMs. However, whether they are effective on judging harmful and benign agentic requests, and how they should be utilized for LLM agents remain unknown. In this paper, we examine the effectiveness of employing guard models as input safeguards for LLM agents. Concretely, we investigate guard models in two paradigms: the conventional way of directly blocking requests that are judged as harmful, and the newly proposed way of reminding LLM agents of judgments on user requests. With comprehensive experiments, we conclude that blocking is not an ideal solution for LLM agents due to over-refusal of guard models on benign user requests. In contrast, the reminding paradigm results in raised refusal of agents on harmful requests, with only a slight reduction in performance for benign requests. Further, we conduct ablation and case studies to investigate the over-refusal issue and the reminding mechanism, providing valuable insights for future improvements in input moderation techniques.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: LLM agent, guard model, agent safety
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7648
Loading