VUT: Versatile UI Transformer for Multimodal Multi-Task User Interface Modeling Download PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: User Interface Modeling, Multimodal input, Multi-task learning, Transformer, Layout Detection, Language Grounding, Image Captioning, Screen Summarization, Tappability Prediction.
Abstract: User interface modeling is inherently multimodal, which involves several distinct types of data: images, structures and language. The tasks are also diverse, including object detection, language generation and grounding. In this paper, we present VUT, a Versatile UI Transformer that takes multimodal input and simultaneously accomplishes 5 distinct tasks with the same model. Our model consists of a multimodal Transformer encoder that jointly encodes UI images and structures, and performs UI object detection when the UI structures are absent in the input. Our model also consists of an auto-regressive Transformer model that encodes the language input and decodes output, for both question-answering and command grounding with respect to the UI. Our experiments show that for most of the tasks, when trained jointly for multi-tasks, VUT has achieved accuracy either on par with or exceeding the accuracy when the model is trained for individual tasks separately.
One-sentence Summary: The work addresses unique challenges of multimodal multi-task learning of distinct tasks for user interface modeling.
9 Replies

Loading