Abstract: Low-shot image classification is a fundamental task in
computer vision, and the emergence of large-scale visionlanguage models such as CLIP has greatly advanced the
forefront of research in this field. However, most existing
CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge
distinct from CLIP. To bridge the gap, this work proposes
a simple and effective probabilistic model ensemble framework based on Gaussian processes, which have previously
demonstrated remarkable efficacy in processing small data.
We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function
with an ensemble of deep kernels built upon various pretrained models. By regressing the classification label directly, our framework enables analytical inference, straightforward uncertainty quantification, and principled hyperparameter tuning. Through extensive experiments on standard benchmarks, we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally, we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also
illustrate that our method, despite relying on label regression, still enjoys superior model calibration compared to
most deterministic baselines.
Loading