Abstract: Few-shot learning deals with problems such as image classification using very few training
examples. Recent vision foundation models show excellent few-shot transfer abilities, but are
large and slow at inference. Using knowledge distillation, the capabilities of high-performing
but slow models can be transferred to tiny, efficient models. However, common distillation
methods require a large set of unlabeled data, which is not available in the few-shot setting.
To overcome this lack of data, there has been a recent interest in using synthetic data. We
expand on this line of research by presenting a novel diffusion model inversion technique
(TINT) combining the diversity of textual inversion with the specificity of null-text inversion.
Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among
small student models on popular benchmarks, while being significantly faster than prior
work. Popular few-shot benchmarks involve evaluation over a large number of episodes,
which is computationally cumbersome for methods involving synthetic data generation. We
also present a theoretical analysis on how the accuracy estimator variance depends on the
number of episodes and query examples, and use these results to lower the computational
effort required for method evaluation. Finally, to further motivate the use of generative
models in few-shot distillation, we demonstrate that our method outperforms training on
real data mined from the dataset used in the original diffusion model training. Source code
is available at TBD [Released with the camera-ready version].
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu-Xiong_Wang1
Submission Number: 3593
Loading