Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs
Abstract: Unifying models by reducing task-specific structures have been studied to facilitate the transfer of learned knowledge.A text-to-text framework has pushed the unification of the model.However, the framework remains limited because it does not allow contents with a layout for input and has a basic assumption that the task can be solved in a single step.To address these limitations, in this paper, we explore a new framework in which a model performs a task by manipulating displayed web pages in multiple steps.We develop two types of task web pages with different levels of difficulty and propose a BERT extension for the framework.We trained the BERT extension with those task pages jointly, and the following observations were made.(1) The model maintains its performance greater than 80% of that of the original BERT separately fine-tuned in a single-step framework in five out of six tasks.(2) The model learned to solve both tasks of difficulty level. (3) The model did not generalize effectively on unseen tasks.These results suggest that although room for improvement exists, we can transfer BERTs to multi-step tasks, such as using graphical user interfaces.
Paper Type: long
0 Replies
Loading