Abstract: Highlights•A novel VIE method, called Generative Compositor, leverages layout and prompt priors.•Three pre-training tasks to improve the model’s spatial contextual capabilities.•A prompt-aware resampler for distilling and merging the multi-modal embeddings.•Significant improvements in few-shot settings.
Loading