Training-Free Modality-Agnostic Concept Sliders: Fine-Grained Control via Diffusion Models of Images, Audio, and Video
Keywords: Diffusion models, Training-free methods, Modality-agnostic, Semantic control in generative models
TL;DR: We propose a training-free, architecture/modality-agnostic method that performs inference-time estimation to enable fine-grained concept control across images, video, and audio.
Abstract: Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling *fine-grained controllable generation*, continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work, we introduce a simple yet effective approach that is fully *training-free* and *modality-agnostic*, achieved by partially estimating the CS formula during inference. To support multimodal evaluation, we extend the CS benchmark to include both video and audio, establishing the first multimodal suite for fine-grained concept generation control. We further propose three modality-agnostic evaluation properties along with new metrics that more faithfully and broadly measures the desired properties. Finally, we identify the open problem of scale selection and non-linear traversals and introduce, a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 9291
Loading