EXTENDED ABSTRACT: Scaling Test-Time Compute via Semantic Critique and Spectral Alignment for Visual Media Generation

Published: 12 May 2026, Last Modified: 12 May 20262nd ViSCALE @ CVPR 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: test-time scaling, inference optimization, diffusion models, vision-language models, agentic systems, training-free alignment, spectral fusion
TL;DR: We propose CritiFusion, a training-free inference-time scaling framework that uses multimodal agentic critique and spectral alignment to achieve dynamic preference optimization for visual generation.
Abstract: Recent generative models have achieved remarkable visual fidelity, yet faithfully aligning generated content with complex human semantics remains challenging. While scaling training compute has plateaued in yielding structural alignment, scaling \textit{test-time compute} emerges as a strong alternative. We introduce CritiFusion, an efficient inference-time scaling framework that integrates multimodal foundation models as agentic critics. The CritiCore module scales test-time reasoning to produce high-level semantic feedback, dynamically guiding the denoising process. To prevent test-time gradient updates from corrupting global geometry, we propose SpecFusion, which merges generation states in the spectral domain. This preserves low-frequency layout constraints while injecting high-frequency semantic corrections. Preliminary results demonstrate that scaling our Multi-LLM agent ensemble monotonically improves human-aligned metrics, offering a highly robust paradigm for training-free test-time preference optimization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading