Abstract: Diffusion models have significantly advanced various image generative tasks, including image generation, editing, and stylization. While text prompts are commonly used as guidance in most generative models, audio presents a valuable alternative, as it inherently accompanies corresponding scenes and provides abundant information for guiding image generative tasks. In this paper, we propose a novel and unified framework named Align, Adapt, and Inject (AAI) to explore the cue role of audio, which effectively realizes audio-guided image generation, editing, and stylization simultaneously. Specifically, AAI first aligns the audio embedding with visual features, and then adapts the aligned audio embedding to an AudioCue enriched with visual semantics, finally injects the AudioCue into existing Text-to-Image diffusion model in a plug-and-play manner. The experiment results demonstrate that AAI successfully extracts rich information from audio, and outperforms previous work in multiple image generative tasks.
Loading