We provide videos generated by SyntheOcc and magicdrive. 

As our work primarily focuses on precise 3D controllability for image generation, we do not specifically tailor for video generation. 

We provide a preliminary attempt to implement a plug-and-play module of cross-view and cross-frame attention to learn view-consistent or frame-consistent generation.

Given that our core contribution does not lie in video generation, this experiment serves as a proof
of concept, demonstrating the potential adaptability of our framework. Future research may extend
our methodology to facilitate the generation of longer video sequences, thereby expanding the scope
and applicability of our framework.