Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: model steering, vision language models, adversarial machine learning
TL;DR: The paper proposes a novel way of using a single image to mimic steering vectors for vision language models
Abstract: Vision Language Models (VLMs) are increasingly being used in a broad range of applications. While existing approaches for activation-based steering vectors require invasive runtime access to model internals incompatible with API-based services and closed source deployments. We introduce VISOR (Visual Input based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. It enables practical deployment across all VLM serving while remaining imperceptible compared to explicit textual instructions. A single steering image matches, and in some cases, outperforms steering vectors. We show the effectiveness of VISOR across three different behavioral steering tasks as well as across two VLMs with different architectures. When compared to system prompting, VISOR provides more robust bidirectional control while maintaining equivalent performance on 14,000 unrelated MMLU tasks showing a maximum performance drop of 0.1% across different models and datasets. Beyond reducing overhead and run-time model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses.
Submission Track: Workshop Paper Track
Submission Number: 13
Loading