Keywords: multimodal, document, command generation, document editing
Abstract: Language guided document editing is a novel task that includes generating a machine parsable command and a bounding box from an open vocabulary user request. This paper introduces Doc2Command, a multi-task, multimodal model that unifies the document and user request into a singular visual modality and utilises a transformer base image encoder-text decoder to generate the command text. Additionally, it reconceptualises bounding box detection as a segmentation task and employs a mask transformer operating on the image encoder. Doc2Command surpasses baseline models in command text generation, demonstrating significant performance improvements ranging from 2-33\% for exact matched commands. It also improves on the bounding box detection task on existing baselines by a margin of 12.19-31.65\%.
Submission Number: 95
Loading