Defending Language Models Against Image-Based Prompt Attacks via User-Provided Specifications

Reshabh K. Sharma; Vinayak Gupta; Dan Grossman

Defending Language Models Against Image-Based Prompt Attacks via User-Provided Specifications

Reshabh K. Sharma, Vinayak Gupta, Dan Grossman

Published: 01 Jan 2024, Last Modified: 05 Jun 2025SP (Workshops) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, there has been an exponential growth of real-world applications where humans interact with chatbots through a combination of images and text. Specifically, these multi-modal large language models (MLLMs) can digest various forms of data and understand users’ intents and needs encoded in both images and text. However, defending these models against prompt-based injection attacks is a less explored area. This problem is further exacerbated by the limitations of current defense mechanisms for language models, which are restricted to handling only text data. This paper introduces a novel defense mechanism for MLLM-based chatbots, addressing image-based injection attacks through a two-stage approach: input validation to identify unsafe inputs before reaching the chatbot, and prompt injection detection to safeguard the MLLM backbone from malicious image attacks. The framework utilizes a domain-specific programming language tailored for secure chatbot definitions and user-specified specifications for chatbots and image inputs. Through experiments on models like GPT-4VISION and LLAVA, we demonstrate the limitations of relying on model robustness and showcase our approach’s effectiveness in improving malicious attack detection for MLLM-based chatbots.

Loading