Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically regard the condition of Gaussian filtered point-wise trajectory as sole motion control signal. Nevertheless, such flow approximation of trajectory via Gaussian kernel severely limits the controllable capacity of fine-grained movement, and commonly fails to disentangle object and camera moving. To alleviate these, we present ReMoCo, a new recipe of region-wise motion controller that novelly leverages precise region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify exact target motion category (i.e., object or camera moving), respectively. Technically, ReMoCo first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories from multiple local regions to simulate inference scenario. Instead of approximating flow distribution via Gaussian filtering, our region-wise trajectory preserves original flow information at local area and thus manages to characterize fine-grained movement. A motion mask is simultaneously derived from the predicted flow maps to present holistic motion dynamics. To pursue natural and controllable motion generation, ReMoCo further strengthens video denoising with additional conditions of region-wise trajectory and motion mask in a feature modulation manner. More remarkably, we meticulously construct a benchmark called ReMoCo-Bench, which consists of 1.1K real-world user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level motion synthesis in I2V generation. Extensive experiments conducted on WebVid-10M and ReMoCo-Bench demonstrate the effectiveness of our ReMoCo for precise motion control.
An overview of our Region-wise Motion Controller (ReMoCo) for controllable image-to-video generation. During training, ReMoCo first extracts the proposed region-wise trajectory and motion mask on the input video as the control signals. The multi-scale features are then learnt on these signals by a motion encoder, which are further injected into the 3D-UNet of SVD in a feature modulation manner. Meanwhile, LoRA layers are integrated into all attention modules in the transformer blocks to improve the optimization of motion-trajectory alignment. In the inference stage, the region-wise trajectory and motion mask are derived from the user provided trajectory and brushed region, and exploited as the guidance to calibrate video generation.