Goku: Generative Flow Kit for Unified Image-Video Creation

Published: 27 Feb 2025, Last Modified: 16 May 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: This paper presents our latest advancements, Goku, a new family of joint image-and-video generation models based on rectified flow Transformers to achieve industry-grade performance. We present the foundational elements required for high-quality visual generation, including data curation, model design, flow formulation, etc. Key contributions inclued a meticulous data filtering pipeline that ensures high-quality, fine-grained image and video data curation; and the pioneering use of rectified flow for enhanced interaction among video and image tokens. Goku models achieve superior performance in both qualitative and quantitative assessments. Notably, \ours achieves top scores on major benchmarks: 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, alongside 82.7 on VBench for text-to-video tasks. We hope this report offers valuable insights into joint image-and-video generation models for the research community.
Loading