Doc2Command: Furthering Language Guided Document Editing

Published: 19 Mar 2024, Last Modified: 04 Apr 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal, document, command generation, document editing
Abstract: Language guided document editing is a novel task that includes generating a machine parsable command and a bounding box from an open vocabulary user request. This paper introduces Doc2Command, a multi-task, multimodal model that unifies the document and user request into a singular visual modality and utilises a transformer base image encoder-text decoder to generate the command text. Additionally, it reconceptualises bounding box detection as a segmentation task and employs a mask transformer operating on the image encoder. Doc2Command surpasses baseline models in command text generation, demonstrating significant performance improvements ranging from 2-33\% for exact matched commands. It also improves on the bounding box detection task on existing baselines by a margin of 12.19-31.65\%.
Submission Number: 95
Loading