Keywords: human image generation, controllable diffusion model
TL;DR: An approach for pose-guided image generation that supports interaction-aware multi-person synthesis via soft attention modulation.
Abstract: Pose-guided human image generation aims to synthesize images of individuals performing specific actions based on pose conditions and textual descriptions. While current methods achieve promising results in single-person scenarios, they often struggle to generalize to multi-person settings, particularly under complex spatial interactions. Existing methods typically employ pose guidance in an undifferentiated manner across the image, leading to structural ambiguity and frequent limb entanglement in interaction zones. To tackle this challenge, we propose SoftPose, a novel approach that learns interaction-aware soft attention to adaptively modulate attention flow across pose regions, enabling more fine-grained focus on both self-related and cross-person occlusion areas. By modeling long-range, global spatial dependencies within and across pose regions, SoftPose effectively resolves ambiguities in interactive scenarios while preserving precise single-person pose fidelity. Additionally, we introduce a progressive feature injection strategy that balances global spatial coherence and local pose details across multiple scales. Extensive experiments demonstrate the superiority of SoftPose compared to current methods in generating high-quality multi-person images with complex interactions and varying scenes.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10143
Loading