Exploring Spatial Understanding Capability in Large Language Models: Proficiency in Layout Generation sans Visual Perception

ACL ARR 2024 June Submission937 Authors

13 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) consistently demonstrate superior performance in various natural language processing (NLP) tasks. However, research on their abilities to process visual and spatial information, which is essential for understanding visually-rich documents (VRDs), is limited. This paper presents a pioneering study and benchmark specifically designed to evaluate the spatial competencies of LLMs in the context of VRDs. Our assessment covers a comprehensive range of dimensions, including spatial perception, positional prediction, information extraction, and layout generation. The results show that despite the lack of inherent visual perception mechanisms in LLMs, these models can effectively infer spatial relationships within VRDs. In addition, we propose a layout-aware learning strategy with off-the-shelf LLMs that can significantly improve their performance. Our results indicate a significant contribution to the field of document intelligence, confirming the effectiveness of our methodology and pointing the way for future research in document analysis.
Paper Type: Short
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: human evaluation,automatic evaluation,few-shot generation,interactive and collaborative generation
Contribution Types: Model analysis & interpretability
Languages Studied: English,Chinese
Submission Number: 937
Loading