Training-Free Modality-Agnostic Concept Sliders: Fine-Grained Control via Diffusion Models of Images, Audio, and Video

Rotem Ezra; Hedi Zisling; Nimrod Berman; Ilan Naiman; Alexey Gorkor; Liran Nochumsohn; Eliya Nachmani; Omri Azencot

Training-Free Modality-Agnostic Concept Sliders: Fine-Grained Control via Diffusion Models of Images, Audio, and Video

Rotem Ezra, Hedi Zisling, Nimrod Berman, Ilan Naiman, Alexey Gorkor, Liran Nochumsohn, Eliya Nachmani, Omri Azencot

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, Training-free methods, Modality-agnostic, Semantic control in generative models

TL;DR: We propose a training-free, architecture/modality-agnostic method that performs inference-time estimation to enable fine-grained concept control across images, video, and audio.

Abstract: Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling *fine-grained controllable generation*, continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work, we introduce a simple yet effective approach that is fully *training-free* and *modality-agnostic*, achieved by partially estimating the CS formula during inference. To support multimodal evaluation, we extend the CS benchmark to include both video and audio, establishing the first multimodal suite for fine-grained concept generation control. We further propose three modality-agnostic evaluation properties along with new metrics that more faithfully and broadly measures the desired properties. Finally, we identify the open problem of scale selection and non-linear traversals and introduce, a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 9291

Loading