Abstract: Controllable captioning has received much attention in recent years. Although substantial progress has been made, existing methods still face challenges such as high training costs, intricate control signals and limited control capabilities. To address these issues, we propose a straightforward and unified framework called ControlCap. It uses a no-fuss lexicon as control signal and controls the style and content of visual descriptions through Soft Guidance (a global guide to the caption distribution) and Hard Force (integrating signals without additional training). Extensive experiments, both quantitative and qualitative, have been conducted on three benchmark captioning tasks. Results demonstrate the control ability of ControlCap: it can produce controlled captions that are coherent and diverse while keeping the core content intact.
Loading