Flow-Based Alignment of Uni-Modal Vision and Text Encoders for Few-Shot Image Classification

ICLR 2026 Conference Submission25102 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: few-shot classification, vision-language models, CLIP adaptation, alignment of uni-modal encoders, flow matching
TL;DR: Few-shot classification framework that aligns uni-modal image and text encoders with orthogonal Procrustes and flow matching, leveraging large-scale or domain-specialized models for adaptation.
Abstract: Few-shot classification with vision–language models remains challenging, particularly when relying on multi-modal encoders such as CLIP that are restricted to paired image–text data. We introduce FSF, a framework that leverages arbitrary uni-modal encoders—including vision or text models that were pretrained on broad or domain-specific corpora—and aligns them for cross-modal classification. FSF first applies a closed-form orthogonal Procrustes map to align image and text embeddings while preserving their geometry, and then trains a lightweight flow-matching prior that regularizes adaptation in the few-shot regime. At inference, images are classified by cosine similarity in the aligned feature space between query embeddings and mapped class prototypes. Experiments on standard benchmarks, ImageNet variants, and VinDr-CXR, a large-scale chest X-ray benchmark, show that FSF is able to leverage stronger or specialized encoders, achieving competitive or superior accuracy compared to recent adaptation methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 25102
Loading