Abstract: Human image animation from a single reference image faces a critical challenge in maintaining texture consistency when synthesizing novel viewpoints that significantly deviate from the reference view. Current methods often produce flickering and artifacts due to implicit hallucination of occluded regions without cross-view guidance. To address this, we introduce Consistent Human Animation with Pseudo Multi-View Anchoring and Cross-Granularity Integration, which introduces pseudo multi-view as reference to enhance the coherence and fidelity of the generated animations by providing more comprehensive visual cues. We firstly integrate the reference pose guider and the rendered images of the Skinned Multi-Person Linear (SMPL) model as driving signals into the Diffusion framework, synthesizing multi-view cues from a single image. Moreover, our cross-granularity guidance architecture combines (1) a anchor frame selector that dynamically retrieves pose-aligned pseudo-views for view-adaptive local texture details, and (2) a hierarchical compression and fusion module distilling view-agnostic global features to ensure consistency and mitigate ambiguity caused by anchor frame switching. Finally, a reference cross-attention layer integrates these cross-granularity features into the animation generation process, enabling viewpoint-stable generation. Extensive experiments on the UBC Fashion Video dataset demonstrate significant reductions in flickering and artifacts under extreme viewpoint changes, validating our method's superiority in synthesizing consistent animations. These results highlight the effectiveness of our designed components in achieving consistent and high-quality human image animation with improved viewpoint consistency, thus driving advancements in synthetic media generation.
External IDs:doi:10.1145/3731715.3733297
Loading