MAP: Parameter-Efficient Tuning for Referring Expression Comprehension via Multi-Modal Adaptive Positional Encoding

Ruilin Yao, Yi Rong, Tianyu Zou, Bo Zhang, Jian Li, Shengwu Xiong, Shili Xiong

Published: 27 Oct 2025, Last Modified: 04 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: This paper studies the challenging task of Referring Expression Comprehension (REC), which aims at detecting the text-referred target object in an input image. To achieve this, most recent works attempt to adapt powerful pretrained models through integrating additional structures (e.g., low-rank adaptation (LoRA) or adapter modules) to enable efficient parameter tuning. However, all these methods process pretrained features in a position-agnostic manner. This will limit their effectiveness in REC tasks, where the positional information is essential to correctly localize the target object. To this end, we propose a novel parameter-efficient tuning approach, named Multi-Modal Adaptive Positional Encoding (MAP), which addresses the above problem from a new perspective of positional encoding. More specifically, MAP first generates initial positional embeddings for different visual encoder layers from a set of learnable vectors, and then adjusts them adaptively based on spatial-wise visual-linguistic correlations of input data. In this way, the positional information of different image tokens can be appropriately modeled and utilized by MAP, thus making it more applicable to REC tasks. Extensive experiments on five widely-used datasets demonstrate that MAP achieves comparable results to full fine-tuning methods with much fewer extra parameters and outperforms other parameter-efficient tuning approaches. Our source code is available at: https://github.com/Mr-Bigworth/MAP.

External IDs:doi:10.1145/3746027.3755810