Keywords: Layout Generation, Zero-shot, GPT4, Multi-modal models, Object detection
TL;DR: We do layout generation using GPT-4V and explore various visual prompting methods to improve the visual grounding ability of GPT-4V.
Abstract: Graphic layout design generation is a challenging problem in computer vision. The key aspect of the challenge is ensuring coherent placement of textual elements on the background image to ensure aesthetic appeal and avoiding occlusion of key visual elements. Although prior methods have made attempts to solve this multi-modal problem, they couldn't perfect it. Owing to the complexity required in understanding the relationship between visual and text elements in the aforementioned task, we investigate GPT-4-Vision(GPT-4V), a large multimodal models(LMMs), to do zero-shot graphic layout design generation in a versatile manner. Our approach explores various off-the-shelf segmentation/superpixel methods to identify and mark the key regions to visually augment the image to enhance GPT-4V's spatial reasoning capability . The results of our comprehensive experiments on a self-curated dataset demonstrates the efficacy of our proposed visual prompting methods, showing improvement over standard GPT-4V prompting method and also performing at par and even better, for some techniques, than state-of-the-art specialist model.The code and data is available at https://anonymous.4open.science/r/VISUAL-PROMPTING-TECHNIQUES-FOR-GPT-4V-BASED-ZERO-SHOT-GRAPHIC-LAYOUT-DESIGN-GENERATION-5A6E
Submission Number: 153
Loading