ERFC: Energy-Aware Reinforcement Feedback Calibration for Zero-Shot Captioning

Qianyue Bao, Fang Liu, Licheng Jiao, Yang Liu, Shuo Li, Lingling Li, Xu Liu, Puhua Chen, Wenping Ma

Published: 2026, Last Modified: 25 Mar 2026IEEE Trans. Circuits Syst. Video Technol. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-shot captioning aims to generate descriptive captions for unseen image and video data by leveraging the potential of visual language models (VLMs) and language models (LMs) without requiring task-specific training. It has emerged as a critical task, but its performance is often hindered by the inherent gap between the training distribution and unseen test data. The fundamental challenge lies in the model’s strong dependence on the marginal distribution of the training data, which leads to biased predictions when handling test samples. To address this issue, we propose an Energy-aware Reinforcement Feedback Calibration (ERFC) framework to calibrate the distribution and predictions of caption models from a novel energy perspective. The calibration process of ERFC is divided into two key components: 1) We first construct an Energy Stabilizer (ES) based on the caption model, where energy is considered a measure of the affinity between the input sample and the model’s learned distribution. ES iteratively adjusts the embedding features of the input sample using Langevin Dynamics, reducing its energy to implicitly align the model’s distribution with the unseen target domain. 2) We deploy a Reinforcement Calibrator (RC) to refine and calibrate the generated captions through a reward-feedback mechanism. RC leverages the expert CLIP model as a reward signal to assess the quality of the generated captions and employs the policy gradient algorithm to reward or penalize the model, thereby improving its performance. By iteratively combining energy-based optimization and reward-driven calibration, ERFC achieves superior zero-shot generalization capabilities, as demonstrated on image benchmarks such as MSCOCO, Flickr30K, and NoCaps, as well as video benchmarks such as MSR-VTT and MSVD.
Loading