HybridSketchNet: Sketch-based 3D Human Mesh Reconstruction via Hybrid Point-Image Networks

Fei Wang; Jiaxin Zhang; Zibo Liu; Hao Cai; Songhua Xu; Xiaonan Luo

HybridSketchNet: Sketch-based 3D Human Mesh Reconstruction via Hybrid Point-Image Networks

Fei Wang, Jiaxin Zhang, Zibo Liu, Hao Cai, Songhua Xu, Xiaonan Luo

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: sketch, 3D human, mesh reconstruction, parametric Mode

Abstract: Sketches are an efficient and effective tool for generating 3D human meshes with arbitrary body shapes and poses. However, current mesh reconstruction methods are mainly designed for natural images, which are hard to apply to sketches due to the abstract and sparse characteristics of the latter. Moreover, there is no dataset with sufficient sketch-meshes pairs for developing and evaluating relevant methods. To tackle these issues, we introduce a hybrid framework that fits parametric human models (e.g., skinned multi-person linear model) to sketches in a coarse-to-fine manner. Specifically, the proposed framework consists of three core components: (i) Given a sketch image as the input, a vision transformer-based Local Image Encoder (LIE) is introduced to model the local structures of the sketch and yields a coarse mesh estimation. (ii) A Global Point Encoder (GPE) taking the 2D coordinates of sketch contours as inputs, is also utilized to obtain the global representation of the sketch. (iii) As the local presentation can depict human poses more precisely while the global representation is more suitable for body shapes, we propose a graph-based refiner (GRefiner) to leverage the advantages of both representations and generate the final well-fitted mesh. Furthermore, we collect a large-scale dubbed Sketch3DS, containing approximately 10,000 paired sketches and human meshes with diverse poses and shapes. Extensive experiments on Sketch3DS demonstrate that the proposed approach outperforms existing methods, achieving accurate alignment between input sketches and constructed human meshes.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9189

Loading