Recent progress in diffusion-based video editing techniques has shown remarkable potential and is being increasingly utilized in practical applications. However, these methods remain prohibitively expensive and particularly challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we propose a new classifier-free guidance distillation with multiple modalities, resulting in a 3× on-device speed-up. Finally, we reduce the number of sampling steps to one (10× speed-up) by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process in contrast to previous arts. Collectively, these optimizations enable video editing at an impressive 12 frames per second on mobile devices, while maintaining high editing quality.
Input
"In chinese ink style"
"In caricature style"
"In pop art style"
Input
"Turn him into silver surfer"
"Add wrinkles"
"Add sunglasses"
Input
"In pixar 3d style"
"Turn him into vampire"
"In pencil drawing style"
Input
"Make him bronze"
"Turn him into hulk"
"In Minecraft style"
Input
"In Monet style"
"In Monet style"
Input
"Make him wooden"
"Make him wooden"
Input
"Make it desert"
"Make it desert"
"Make it desert"
"Make it desert"
Input
"Turn the swan into flamingo"
"Turn the swan into flamingo"
"Turn the swan into flamingo"
"Turn the swan into flamingo"
Input
"Add grass"
"Add grass"
"Add grass"
"Add grass"
Input
"Add snow"
"Add snow"
"Add snow"
"Add snow"
Input
"Make him zombie"
"Make him zombie"
"Make him zombie"
"Make him zombie"
Input
"Make him yeti"
"Make him yeti"
"Make him yeti"
"Make him yeti"
Input
"Make her hair blonde"
"Make her hair blonde"
"Make her hair blonde"
"Make her hair blonde"
Input
"Add fire"
"Add fire"
"Add fire"
"Add fire"