Abstract: Text-guided diffusion models have revolutionized static 3D generation, which significantly accelerated progress in 4D content creation. However, applying diffusion models to 4D content creation poses huge challenges due to the complexity and diversity of motion. The task of text to 4D customized generation requires a large amount of guide data, and it is challenging to integrate diverse knowledge from multiple diffusion models. To handle these challenges, we present Motion4D,
a novel framework focusing on motion customization in 4D creation tasks, adopting a spatial-temporal slicing strategy towards the generation process. Firstly, the initialized 4D Gaussian field (XYZ-T) is temporally sliced into 3D scenes corresponding to discrete time points along the time axis. Secondly, for spatial dimension, 3D objects are further decomposed into orthogonal multi-view images to capture geometric and appearance features from various perspectives. This spatial-temporal slicing enables a comprehensive representation of object motion and variation across both temporal and spatial dimensions, facilitating customized 4D modeling. Extensive experiments demonstrate that our method surpasses prior state-of-the-art methods in terms of generation efficiency and motion consistency across various prompts.
Loading