Text Slider: Efficient and Precise Concept Control for Video Generation and Editing via LoRA Adapters

Pin-Yen Chiu; I-Sheng Fang; Jun-Cheng Chen

Text Slider: Efficient and Precise Concept Control for Video Generation and Editing via LoRA Adapters

Pin-Yen Chiu, I-Sheng Fang, Jun-Cheng Chen

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model, Controllable Video Generation, Video Editing, Low-Rank Adaptation

TL;DR: We train slider on the text encoder, allowing it to generalize across SD-1.5, SD-XL, and video generation without retraining, reducing parameters by up to 85% and training time by over 90%.

Abstract: Video generation and editing using diffusion models have made significant progress in recent years. While free-form text prompts provide flexible control over generation and attribute manipulation, existing methods still struggle to achieve fine-grained control over specific attributes. Moreover, expressing varying degrees of attribute intensity through text alone is often challenging. For example, describing subtle variations in a person’s smile can be ambiguous and imprecise. Furthermore, the existing method suffers from limited adaptability and inefficient training. To address these limitations, we introduce Text Slider, a lightweight, efficient and highly adaptable framework that identifies low-rank directions within a pre-trained text encoder, enabling precise control of visual concepts while significantly reducing training time and the number of parameters. Text Slider is plug-and-play, easily composable, and continuously modulated, providing enhanced controllability and fine-grained manipulation for video generation and editing. We demonstrate that Text Slider effectively attenuates or strengthens specific attributes while preserving the original input layout and structure, surpassing current state-of-the-art methods in controllable video synthesis.

Submission Number: 2

Loading