Research Area: LMs with tools and code
Keywords: document editing, multimodal, visual programming
TL;DR: We propose an LLM agent framework that can automate the editing requests for content-rich documents via visual programming, grounding, and feedback mechanisms.
Abstract: Editing content-rich and multimodal documents, such as posters, flyers, and slides, can be tedious if the edits are complex, repetitive, or require subtle skills and deep knowledge of the editing software.
Motivated by recent advancements in both Large Language Model (LLM) agents and multimodal modeling, we propose a framework that automates document editing which takes as input a linguistic edit request from the user and then performs sequential editing actions to the document the satisfy the request.
Our proposed method, Agent-DocEdit, first grounds the edit request directly in the underlying document structure to identify the elements that need to be manipulated. Then, we rely on the agent capabilities of LLMs to generate an edit program which calls a set of pre-defined APIs to modify the underlying structure of the document.
To improve the generated edit program, we leverage a feedback mechanism incorporating a deterministic code executor and a multimodal LLM.
We demonstrate the effectiveness of our proposed modularized LLM editing agent on the DocEdit dataset, where Agent-DocEdit outperforms existing state-of-the-art methods by 70+% in document element grounding and 16+% on final rendition generation.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 265
Loading