From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: noisy-label theory, weak supervision, vision-language models, medical imaging benchmarks, predictive evaluation, capability quantification, foundation model labelers, BiomedCLIP, label noise crossover
TL;DR: Classical noisy-label theory predicts a crossover in VLM weak supervision; we calibrate it on three medical-imaging benchmarks across an 11x architecture sweep and re-emit the bound as a decision rule operable from 10-20 gold labels.
Abstract: Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an $11\times$ parameter range. The crossover predicted by theory appears at $n_g \approx 100$ on PCAM, $20$--$50$ on ISIC, and $250$--$500$ on NIH-CXR; weak labels above the crossover degrade AUC by up to $-0.10$. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep ($2.5\times$ parameters, identical pretraining) confirms the labeler---not the student---is the binding constraint. The calibration in turn produces a decision rule operable from $10$--$20$ gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 123
Loading