Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.

Nafew Azim; Fuad Rahman; Nabeel Mohammed

Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.

Nafew Azim, Fuad Rahman, Nabeel Mohammed

Published: 18 Apr 2026, Last Modified: 25 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Captioning, Vision-Language Models (VLM), Logit Steering, Model Arithmetic.

TL;DR: We introduce Adaptive Weighted Proxy Tuning (AWPT), a cost-efficient gray-box framework that dynamically steers Large Vision-Language Models to match fully fine-tuned performance on specialized domains without requiring expensive parameter updates.

Abstract: Adapting Large Vision-Language Models (LVLMs) to specialized domains typically demands resource-intensive fine-tuning or access to proprietary parameters (``white-box'' access). While decoding-time strategies like Proxy Tuning offer a parameter-efficient alternative, they rely on rigid, static logit arithmetic that fails to account for instance-specific variations in model certainty and domain shift. In this work, we introduce Adaptive Weighted Proxy Tuning (AWPT), a gray-box steering framework that dynamically modulates the logit contributions of a large base model, a fine-tuned expert, and an untuned anti-expert. Unlike static approaches, AWPT introduces two instance-aware mechanisms: (1) a lightweight ViT-based Weight Predictor that performs amortized inference to estimate optimal mixing coefficients in real-time with negligible added latency ($\sim$0.03s overhead), and (2) a Per-Sample Optimization objective that establishes theoretical performance bounds via gradient-based logit steering. Extensive evaluation across medical (ROCOv2, IU-Xray) and general domains (Flickr30k, MS COCO, TextCaps) demonstrates that AWPT achieves performance parity with fully fine-tuned models while remaining parameter-free regarding the generator. Crucially, our dynamic weighting acts as an effective regularizer, significantly reducing object hallucinations and establishing AWPT as a robust solution for deploying general-purpose LVLMs in safety-critical contexts.

Submission Type: Emerging

Copyright Form: pdf

Submission Number: 292

Loading