Abstract: In this paper, a multi-modal model based 3D pop-out video generation framework (CP3) is proposed to solve the shortcomings of the existing video generation technology for accurate control of 3D pop-out effects. 3D pop-out effects create an immersive visual experience by changing the disparity of a particular object so that it appears beyond the screen. However, although software has made some progress in this area, there is currently no effective way to accurately control 3D pop-out effects and generate high-quality video. In addition, the lack of high-quality 3D pop-out effect data sets is also one of the bottlenecks in the field. Therefore, the CP3 framework proposed in this paper utilizes multi-modal models to help 3D video creators make 3D pop-out effects, enhance the audience's sense of immersion and visual comfort, and thus promote the development of 3D effect generation technology. To support the training and evaluation of this framework, a new dataset containing 37000 frames of pop-out effects is constructed, such as text guidance, segmentation results, depth maps, optical flow, and the trajectory of the pop-out target. Through the 3D UNet model based on the potential de-noising diffusion mechanism, combined with the 3D-try module in the CP3 framework and Mask Encoder, this paper has achieved remarkable results in the generation of 3D pop-out effect videos. The results of the experiment show that the CP3 framework demonstrates its advantages in generating immersive 3D pop-out effects in comparison to existing technologies.
External IDs:doi:10.1145/3746027.3755082
Loading