Controllable Human Video Generation From Sparse Sketches

Published: 2025, Last Modified: 06 Nov 2025IEEE Trans. Vis. Comput. Graph. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent advancements in human fashion video generation have transformed the field, producing various promising effects. Existing methods mainly focus on pose control but lack the ability to achieve sketch-based control, largely due to the absence of appearance-consistent and shape-varying knowledge in existing datasets. Moreover, the necessity of sequential structure inputs to control video generation hinders real-world applications. To address these limitations, we introduce Sketch2HumanVideo, an approach that, for the first time, achieves sketch-controllable human video generation with three conditions: temporally sparse sketches, a spatially sparse pose sequence, and a reference appearance image. Our key contribution is a sparse sketch encoder, which takes the first two conditions as input, enabling precise and multi-view control of shape motion. To provide the above knowledge, we leverage the expertise of two pretrained models to synthesize a dataset comprising shape-varying yet appearance-consistent examples for model training. Furthermore, we introduce an enlarging-and-resampling scheme to enhance high-frequency details of local regions in resource-constrained scenarios, thereby promoting the generation of realistic videos. Through qualitative and quantitative experiments, our method showcases superior performance to state-of-the-art approaches and flexible control.
Loading