Training Spatial-Frequency Visual Prompts and Probabilistic Clusters for Accurate Black-Box Transfer Learning
Abstract: Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services and proprietary software, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, the proposed method efficiently reduces computational costs for training and inference phases.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work significantly contributes to the fields of multimedia and multimodal processing by addressing key challenges in the application of black-box pre-trained models in real-world scenarios, particularly under constraints of data scarcity and limited computational resources. The novel framework presented in this paper is designed to bridge the data distribution gap commonly encountered in general models, making it especially relevant as AI models increasingly shift towards black-box API formats.
Our innovative approach not only adapts vision recognition models for targeted data distributions through spatial-frequency hybrid visual prompting but also introduces a novel training technique that enhances class separation via probabilistic clusters. This is crucial for improving the interpretability and efficacy of multimodal systems where accurate and domain-specific model tuning is typically challenging.
Furthermore, the potential for this methodology to extend to vision-language multimodal models is particularly promising. In today's landscape, where resource-constrained users rely on API-driven solutions, our method offers a viable option for efficiently implementing AI with significantly reduced computational demands. By enhancing the interface between vision and language processing capabilities, our approach paves the way for broader applications in multimedia environments, thereby pushing the boundaries of what is achievable with multimodal processing technologies.
Supplementary Material: zip
Submission Number: 1959
Loading