Few-Shot Prompting with Vision Language Model for Pain Classification in Infant Cry Sounds

Anthony McCofie, Abhiram Kandiyana, Peter R. Mouton, Yu Sun, Dmitry B. Goldgof

Published: 01 Jan 2025, Last Modified: 10 Nov 2025CBMS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Accurately detecting pain in infants remains a complex challenge. Conventional deep neural networks used for analyzing infant cry sounds typically demand large labeled datasets, substantial computational power, and often lack interpretability. In this work, we introduce a novel approach that leverages OpenAI's vision-language model, GPT-4(V), combined with mel spectrogram-based representations of infant cries through prompting. This prompting strategy significantly reduces the dependence on large training datasets while enhancing transparency and interpretability. Using the USF-MNPAD-II dataset, our method achieves an accuracy of 83.33 % with only 16 training samples, in contrast to the $\mathbf{4, 9 1 4}$ samples required in the baseline model. To our knowledge, this represents the first application of few-shot prompting with vision-language models such as GPT-4o for infant pain classification.

External IDs:dblp:conf/cbms/McCofieKM0G25