When Sketches Diverge, Language Converges: A Universal Feature Anchor for Domain-Agnostic Human Reconstruction
Keywords: Language and Vision, Multimodal Learning
TL;DR: Our framework leverages text descriptions to guide feature learning, creating domain-agnostic representations that transcend the synthetic-freehand divide.
Abstract: When humans sketch the same pose, no two drawings are alike. Synthetic sketches exhibit algorithmic precision with clean edges and consistent strokes, while freehand sketches diverge wildly—each bearing the unique abstraction, style, and imperfections of its creator. This fundamental divergence has long challenged 3D human reconstruction systems, which struggle to bridge the chasm between these disparate visual domains. We present a paradigm shift: while sketches diverge, language converges. A pose described as "arms raised overhead" carries the same semantic meaning whether drawn by algorithm or artist. Building on this insight, we introduce a universal feature anchor—natural language—that remains constant across visual variations. Our framework leverages text descriptions to guide feature learning, creating domain-agnostic representations that transcend the synthetic-freehand divide. At the technical core lies our Text-based Body Pose Head (TBPH), featuring a novel gating mechanism where language-derived features dynamically reweight spatial regions of sketch features. This text-guided attention enables the model to focus on semantically meaningful pose indicators while suppressing domain-specific noise and stylistic artifacts. By augmenting 26,000 sketch-pose pairs with rich textual descriptions, we enable cross-modal supervision that teaches our model to see past surface differences to underlying pose semantics. Extensive experiments demonstrate our method's superiority: we achieve 139.86mm MPJPE on freehand sketches, a 4.5% improvement over the state-of-the-art TokenHMR, and further outperform it by 11.08% in zero-shot generalization on a newly collected dataset. More importantly, we show true domain-agnostic performance—our model trained on both domains exhibits minimal degradation when tested on highly abstract amateur sketches. This work establishes language as a powerful intermediary for visual domain adaptation, opening new avenues for robust cross-domain understanding in computer vision.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8945
Loading