Abstract: Completing paperwork is a challenging and time-consuming problem.
Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM.
For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use.
We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user.
We find that baseline VLAs achieve less than 1\% accuracy in most cases, primarily due to poor localization ability.
GUI agents also struggle, scoring between 10.6-68.0\% despite high cost and latency.
Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form.
With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2\% to 56\%.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: image text matching, cross-modal application, cross-modal information extraction, multimodal applications, financial/business NLP, multihop QA, benchmarking,
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6080
Loading