Abstract: Highlights•Core Problem Identification: We identify the fundamental alignment challenge between images and text in open-vocabulary object detection.•Bottleneck Adapter: We propose a lightweight plug-and-play module that distills key features and enhances cross-modal alignment.•Transferable Prompt: We develop a novel training paradigm that learns generalizable prompt representations without architectural modifications.•Performance Achievement: Our approach seamlessly improves open vocabulary detectors’ generalization ability while maintaining efficient inference.
External IDs:dblp:journals/ijon/JiangQLZL26
Loading