HyperCLIP: Prompt-Conditioned Image Encoders for Contrastive Vision-Language Pre-training

TMLR Paper9001 Authors

17 May 2026 (modified: 29 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: CLIP-style image encoders are trained to be discriminative for every category set a user might supply, since the category set is unknown at training time. This makes the encoder's job harder than the job any single deployment actually requires, and is part of why small image encoders underperform large ones on zero-shot classification. In CLIP, the class prompts available at inference are used only to define the classifier head; we argue they carry more task structure than this role exposes, enough to also modulate the image encoder's feature extraction through a small channel (BatchNorm scale and bias). We provide evidence for this view by introducing HyperCLIP, a contrastive pre-training architecture in which a hypernetwork generates the BatchNorm scale and bias of a small image encoder directly from the class-prompt embeddings produced by the text encoder, with all three components trained jointly under the SigLIP loss. Across eight small vision backbones, HyperCLIP improves zero-shot accuracy over a matched SigLIP baseline by up to 3.3% on ImageNet-1K and 5.6% on CIFAR-100; the gains concentrate in BatchNorm-rich backbones, are equivalent to one step up the EfficientNet scaling ladder, and recover roughly half of what supervised BatchNorm fine-tuning can achieve, without any task labels and with no added inference-time cost.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Massimiliano_Mancini1
Submission Number: 9001
Loading