Keywords: Gray-box fine-tuning, Foundation models, Vision–language retrieval
TL;DR: We introduce Gray-box fine-tuning, to adapt foundation models with lightweight input/output adapters that use only gradient access, never touching backbone weights or layers.
Abstract: Modern foundation models achieve state-of-the-art performance across diverse modalities, yet commonly require modification of internal weights or insertion of new layers during fine-tuning. Such modifications increase deployment complexity, hinder optimization for edge devices, and risk exposure of proprietary model parameters. In this paper we analyze for the first time existing fine-tuning paradigms in the context of these three axes. Within this context we introduce ``Gray-box'' fine-tuning: a lightweight and deployment-friendly framework that adapts frozen backbones without altering their architecture or internal parameters. Gray-box fine-tuning enables adaptation solely via compact, external input/output adapters trained with controlled gradient signals at predefined model entry points, preserving all internal components unchanged. We introduce two variants: DarkGray-Box Adaptation (DGA), restricting modifications strictly to input and output interfaces, and LightGray-Box Adaptation (LGA), allowing limited injection of learnable tokens at intermediate layers for enhanced adaptability. Extensive evaluations across tasks including text-to-image retrieval, video retrieval, image classification, sketch retrieval, and diffusion-based generation demonstrate that Gray-box methods achieve competitive performance relative to standard fine-tuning, despite significantly stricter constraints. By decoupling task-specific adaptation from internal model modifications, Gray-box fine-tuning provides an efficient, scalable, and secure alternative to conventional fine-tuning methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 8681
Loading