Abstract: Recent prosperity of text-to-image diffusion models, e.g.
Stable Diffusion, has stimulated research to adapt them to
360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises
questions about the underlying mechanisms enabling this
empirical success. We hypothesize and examine that the
trainable counterparts exhibit distinct behaviors when finetuned on panoramic data, and such an adaptation conceals
some intrinsic mechanism to leverage the prior knowledge
within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the
attention modules are responsible for common information
that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for
panorama generation. We empirically verify these insights
by introducing a simple framework called UniPano, with
the objective of establishing an elegant baseline for future
research. UniPano not only outperforms existing methods
but also significantly reduces memory usage and training
time compared to prior dual-branch approaches, making it
scalable for end-to-end panorama generation with higher
resolution. The code is available.
Loading