Text Slider: Efficient and Precise Concept Control for Video Generation and Editing via LoRA Adapters
Keywords: Diffusion Model, Controllable Video Generation, Video Editing, Low-Rank Adaptation
TL;DR: We train slider on the text encoder, allowing it to generalize across SD-1.5, SD-XL, and video generation without retraining, reducing parameters by up to 85% and training time by over 90%.
Abstract: Video generation and editing using diffusion models have made significant progress in recent years. While free-form text prompts provide flexible control over generation and attribute manipulation, existing methods still struggle to achieve fine-grained control over specific attributes. Moreover, expressing varying degrees of attribute intensity through text alone is often challenging. For example, describing subtle variations in a person’s smile can be ambiguous and imprecise. Furthermore, the existing method suffers from limited adaptability and inefficient training. To address these limitations, we introduce Text Slider, a lightweight, efficient and highly adaptable framework that identifies low-rank directions within a pre-trained text encoder, enabling precise control of visual concepts while significantly reducing training time and the number of parameters. Text Slider is plug-and-play, easily composable, and continuously modulated, providing enhanced controllability and fine-grained manipulation for video generation and editing. We demonstrate that Text Slider effectively attenuates or strengthens specific attributes while preserving the original input layout and structure, surpassing current state-of-the-art methods in controllable video synthesis.
Submission Number: 2
Loading