4DEditPro: Progressively Editing 4D Scenes from Monocular Videos with Text Prompts

Submitted to ICLR 2025

Abstract

Editing 4D scenes using text prompts is a novel task made possible by advances in text-to-image diffusion models and differentiable scene representations. However, conventional approaches typically use multi-view images or videos with camera poses as input, which causes inconsistencies when editing monocular videos due to the reliance of these tools on iteratively per-image editing and the absence of multi-view supervision. Furthermore, these techniques usually require external Structure-from-Motion (SfM) libraries for camera pose estimation, which can be impractical for casual monocular videos. To tackle these hurdles, we present 4DEditPro, a novel framework that enables consistent 4D scene editing on casual monocular videos with text prompts. In our 4DEditPro, the Temporally Propagated Editing (TPE) module guides the diffusion model to ensure temporal coherence across all input frames in scene editing. Furthermore, the Spatially Propagated Editing (SPE) module in 4DEditPro introduces auxiliary novel views near the camera trajectory to enhance the spatial consistency of edited scenes. 4DEditPro employs a pose-free 4D Gaussian Splatting (4DGS) approach for reconstructing dynamic scenes on monocular videos, which progressively recovers relative camera poses, reconstructs the scene, and facilitates scene editing. We have conducted extensive experiments to demonstrate the effectiveness of our approach, including both quantitative measures and user studies.


The demo video of 4DEditPro.

Method



Our proposed 4DEditPro. This pipeline utilizes the TPE module to generate a temporally consistent video sequence, employs the SPE module to interpolate and refine novel views near the camera trajectory of the original monocular video, and integrates a progressive 4D Gaussian representation for estimating camera poses and reconstructing the 4D scenes.


Some Results

Original
An origami black swan is swimming over the river.
GSEditor-4D Ours
Original
A rhino is walking at night.
GSEditor-4D Ours
Original
A Steampunk boat is sailing.
GSEditor-4D Ours