From Vision to Action: Enabling Real-World Agentic VLMs

Aravilli Atchuta Ram

From Vision to Action: Enabling Real-World Agentic VLMs

Aravilli Atchuta Ram

Published: 12 Nov 2025, Last Modified: 21 Nov 2025VLM4RWD2025 DemoEveryoneRevisionsBibTeXCC BY 4.0

Track: Demo papers (2-4 pages)

Keywords: Agentic VLMs, AI Safety, Model Context Protocol

TL;DR: SafeAgent-MCP enforces proactive safety for GUI automation agents by combining semantic PII blocking and structured action constraints before execution.

Abstract: Mobile and desktop task automation using agentic vision-language models (VLMs) faces critical safety challenges: processing sensitive UI screenshots containing personally identifiable information (PII), generating potentially harmful GUI actions, and operating autonomously without human oversight. Current approaches like VisionTasker rely on post-execution validation or lack generation-time safety guarantees entirely. We present SafeAgent-MCP, a framework combining semantic-constrained decoding with context-aware entity detection for generation-time safety and Model Context Protocol (MCP) for dynamic policy management. Our system enforces three constraint types: (1) Entity-level blocking using context- aware semantic entity recognition to prevent PII leakage from screenshots, (2) Action-space constraints via NVIDIA NIM structured generation restricting dangerous GUI operations, and (3) Policy-aware refusal leveraging OpenAI’s gpt-oss- safeguard for reasoning-based safety validation. SafeAgent-MCP provides the first systematic safety framework for production GUI automation agents combining modular semantic entity recognition with industry-standard constrained generation infrastructure.

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 20

Loading