FUSE: Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation

Dewang Sultania, Tushar Vatsa, Vibha Belavadi, Suhas Suresha

Published: 15 Nov 2025, Last Modified: 30 Jan 2026ICDM MMSR 2025: Workshop on Multimodal Search and RecommendationsEveryoneCC BY 4.0

Abstract: Multimodal creative assistants decompose user goals and route tasks to subagents for layout, styling, retrieval, and generation. Retrieval quality is pivotal, yet failures can arise at several stages: understanding user intent, choosing content types, finding candidates (recall), or ranking results. Meanwhile, sending and processing images is costly, making naive multimodal approaches impractical. We present FUSE: Failureaware Usage of Subagent Evidence for multimodal search and recommendation. FUSE replaces most raw-image prompting with a compact Grounded Design Representation (GDR): a selectionaware JSON of canvas elements (image, text, shape, icon, video, logo), structure, styles, salient colors, and user selection provided by the Planner team. FUSE implements seven context budgeting strategies: comprehensive baseline prompting, context compression, chain-of-thought reasoning, mini-shot optimization, retrieval-augmented context, two-stage processing, and zero-shot minimalism. Finally, a pipeline attribution layer monitors system performance by converting subagent signals into simple checks: intent alignment, content-type/routing sanity, recall health (e.g., zero-hit and top-match strength), and ranking displacement analysis. We evaluate the seven context budgeting variants across 788 evaluation queries from diverse users and design templates (refer Figure 3). Our systematic evaluation reveals that Context Compression achieves optimal performance across all pipeline stages, with 93.3% intent accuracy, 86.8% routing success(with fallbacks), 99.4% recall, and 88.5% NDCG@5. This approach demonstrates that strategic context summarization outperforms both comprehensive and minimal contextualization strategies. The narrow intent performance variance (89.5-93.3%) across variants validates that all context budgeting approaches provide substantial benefits over zero-shot baselines. We also measure p50/p95/p99 latency and normalized token cost: Context Compression delivers the best latency–cost trade-off (45% lower p95 vs. Baseline; 8 times fewer input tokens vs. Chain-ofThought) while Chain-of-Thought provides maximal reasoning at the highest cost.