Abstract: Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated images while efficiently reducing computational requirements. We conduct comprehensive experiments on three benchmarks to demonstrate the significant progress in solving text-to-image generation tasks, showcasing ReCorD's ability to render complex interactions accurately by outperforming existing methods in HOI classification score, as well as FID and Verb CLIP-Score.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Generative Multimedia, [Generation] Multimedia Foundation Models
Relevance To Conference: This work introduces a pioneering approach to multimedia and multimodal processing by integrating Latent Diffusion Models (LDMs) with Visual Language Models (VLMs) to address the nuanced challenge of human-object interaction (HOI) in text-to-image (T2I) generation. Through the innovative use of LDMs for generating diverse pose options and VLMs for selecting the most suitable poses and object placements based on visual cues and textual descriptions, this methodology offers a sophisticated mechanism for ensuring generated images' fidelity and contextual coherence. Leveraging attention modulation techniques refines the final image output, enhancing the quality and accuracy of the visual content produced from textual prompts. This integration marks a significant advancement in multimedia processing, facilitating the creation of more nuanced, contextually accurate visual content that aligns closely with textual descriptions. It addresses the critical challenge of spatial accuracy and the depiction of complex interactions in T2I synthesis, pushing the boundaries of what is achievable in multimodal processing and generation.
Supplementary Material: zip
Submission Number: 963
Loading