Visually-augmented pretrained language models for NLP Tasks without Images

Hangyu Guo; Kun Zhou; Xin Zhao; Qinyu Zhang; Ji-Rong Wen

Visually-augmented pretrained language models for NLP Tasks without Images

Hangyu Guo, Kun Zhou, Xin Zhao, Qinyu Zhang, Ji-Rong Wen

Published: 01 Feb 2023, Last Modified: 22 Jun 2025Submitted to ICLR 2023Readers: Everyone

Keywords: Visually-augmented pretrained language models for NLP tasks without images

Abstract: Although pre-trained language models (PLMs) have shown impressive perfor- mance by text-only self-supervised training, they are found lack of visual se- mantics or commonsense, e.g., sizes, shapes and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmenta- tion (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention- and learning-based strategies. Then, we adopts a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by visual-language alignment task on large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and trans- formed into several pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word repersentations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, Com- monsenseQA, CommonGen and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART and T5 at different scales, and outperform several competitive baselines signifi- cantly. Besides, the generated visual prompts of our framework can also be used for parameter-efficient tuning, which boosts the performance of T5-3B. We will make our code, data, and models publicly available.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/visually-augmented-pretrained-language-models/code)

28 Replies

Loading