Keywords: VLM; Robustness
Abstract: Vision-language models (VLMs) excel at document summarization, yet their robustness to visual formatting variations in document images is not well understood. We present VITA (Visual In-image Text Analysis), a systematic framework that measures how realistic visual changes---text emphasis and structural formatting---affect summarization quality. We evaluate six VLMs spanning early, middle, and late fusion architectures at two model scales. Across lexical, semantic, and information preservation metrics, we find architecture- and scale-dependent vulnerabilities: early fusion loses more information despite higher lexical stability, whereas late fusion preserves information but exhibits larger lexical variation. Structural formatting induces larger degradations than text emphasis. Scaling mitigates emphasis sensitivity but can exacerbate structural-format vulnerabilities in late fusion, indicating that robust document understanding may require architectural innovations beyond scaling.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 9770
Loading