Abstract: Multimodal creative assistants decompose user goals
and route tasks to subagents for layout, styling, retrieval, and
generation. Retrieval quality is pivotal, yet failures can arise
at several stages: understanding user intent, choosing content
types, finding candidates (recall), or ranking results. Meanwhile, sending and processing images is costly, making naive
multimodal approaches impractical. We present FUSE: Failureaware Usage of Subagent Evidence for multimodal search and
recommendation. FUSE replaces most raw-image prompting with
a compact Grounded Design Representation (GDR): a selectionaware JSON of canvas elements (image, text, shape, icon,
video, logo), structure, styles, salient colors, and user selection
provided by the Planner team. FUSE implements seven context
budgeting strategies: comprehensive baseline prompting, context
compression, chain-of-thought reasoning, mini-shot optimization,
retrieval-augmented context, two-stage processing, and zero-shot
minimalism. Finally, a pipeline attribution layer monitors system
performance by converting subagent signals into simple checks:
intent alignment, content-type/routing sanity, recall health (e.g.,
zero-hit and top-match strength), and ranking displacement
analysis. We evaluate the seven context budgeting variants across
788 evaluation queries from diverse users and design templates
(refer Figure 3). Our systematic evaluation reveals that Context
Compression achieves optimal performance across all pipeline
stages, with 93.3% intent accuracy, 86.8% routing success(with
fallbacks), 99.4% recall, and 88.5% NDCG@5. This approach
demonstrates that strategic context summarization outperforms
both comprehensive and minimal contextualization strategies.
The narrow intent performance variance (89.5-93.3%) across
variants validates that all context budgeting approaches provide
substantial benefits over zero-shot baselines. We also measure
p50/p95/p99 latency and normalized token cost: Context Compression delivers the best latency–cost trade-off (45% lower
p95 vs. Baseline; 8 times fewer input tokens vs. Chain-ofThought) while Chain-of-Thought provides maximal reasoning
at the highest cost.
Loading