Improving Language Understanding from ScreenshotsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: An emerging family of language models (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We name them as screenshot language models. Despite their appeal, existing screenshot LMs substantially lag behind text-only models on language understanding. To close this gap, we focus on a simplified setting where screenshots are rendered from plain text. We propose a novel Patch-and-Text Prediction (PTP) objective where we mask and recover both image patches of screenshots and text within screenshots. We also conduct careful ablation studies in masking rates, patch sizes, and designs for training stability. Our pre-trained model, while solely taking visual inputs, achieves comparable performance with BERT (within 2%) on 6 out of 8 GLUE tasks and improves up to 8% on specific datasets over prior work. Additionally, we extend PTP to train autoregressive screenshot LMs and demonstrate its effectiveness---our models can significantly reduce perplexity by utilizing the screenshot context. Together, we hope our findings can inspire future research on developing powerful screenshot LMs, and extending their reach to broader applications.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview