Five-Mode Tucker-LoRA for Video Diffusion on Conv3D Backbones

ICLR 2026 Conference Submission14446 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Diffusion, Parameter-Efficient Finetuning, LoRA, Tensor Decomposition, Tucker, 3D Convolution
Abstract: Parameter-efficient fine-tuning for text-to-video diffusion remains challenging. Most LoRA-style adapters either flatten 3D kernels into 2D matrices or add temporal-only modules, which breaks the native structure of Conv3D backbones. We present a five-mode Tucker-LoRA that learns a Tucker residual directly on the 5-D convolutional weight update across output/input channels, time, height, and width. This preserves spatio-temporal geometry and enables mode-wise rank budgets; setting some ranks to one (or the temporal rank to zero) recovers common 2D or temporal-only adapters. We instantiate the adapter in VideoCrafter (Conv3D U-Net) and AnimateDiff (2D+motion) under a unified 16×224 evaluation protocol on MSR-VTT. The method achieves a favorable memory–quality trade-off compared with strong 2D/pseudo-3D baselines and reaches target FVD earlier in time-to-target analysis. Results and ablations suggest that respecting the full dimensionality of video kernels is key for budgeted, tensorized adaptation.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 14446
Loading