Learning to Align and Act: Cross-Modal Gating and Multimodal Reward Shaping for Web Agents

Diyang Guan

Learning to Align and Act: Cross-Modal Gating and Multimodal Reward Shaping for Web Agents

Diyang Guan

17 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Web Agents, Multimodal Learning, Reward Shaping

TL;DR: Cross-modal gating with reward shaping achieves 8% higher success and 44% lower sample complexity for real-world web agents.

Abstract: Web-based reinforcement learning agents face two fundamental challenges that limit their effectiveness in real-world applications: cross-modal misalignment between visual screenshots and HTML DOM representations, and severe reward sparsity in multi-step interaction tasks. Existing approaches typically rely on static fusion strategies that fail to adapt to the dynamic importance of different modalities across task phases, while sparse binary rewards provide insufficient guidance for efficient learning in long-horizon scenarios. To address these limitations, we propose a novel framework that integrates cross-modal attention gating with multimodal feedback-driven reward shaping. Our gating mechanism dynamically regulates the contribution of visual and textual modalities based on task context and trajectory history, enabling adaptive coordination throughout the decision-making process. Simultaneously, our reward shaping approach decomposes sparse terminal rewards into dense step-level signals derived from both visual UI state changes and textual content validation, providing informative feedback at each interaction step. Extensive experiments on MiniWoB++, WebShop, and Mind2Web demonstrate that our method achieves significant improvements over strong baselines, with 6-8% gains in task success rate and 44% reduction in sample complexity. Ablation studies reveal that the combination of gating and shaping yields synergistic benefits, with our gating controller learning to make confident, context-aware modality selections (achieving 25% lower attention entropy) while the multimodal reward shaping increases feedback density from 3-6% to 23-30% of interaction steps. These results establish a new paradigm for multimodal reinforcement learning in web environments, demonstrating that adaptive modality coordination and granular feedback alignment are essential for robust and efficient web agent training.

Primary Area: reinforcement learning

Submission Number: 9512

Loading