Improving Food Recognition with Retrieval-Augmented and Domain-Adaptive LVLMs

Dehua Ma, Zhenbo Xu, Tianshun Xing, Lu Yuan, Jinghan Yang, Huijia Wu, Ming Lei, Zhaofeng He

Published: 01 Jan 2025, Last Modified: 20 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Food recognition is pivotal in enhancing intelligent food recommendation systems and nutritional management, contributing to balanced diets and overall health. Although Large Vision-Language Models (LVLMs) have demonstrated impressive performances across various domains, their performance on the food recognition task still lags behind traditional vision models. To bridge this gap, this paper proposes two methods to improve the food recognition capabilities of LVLMs: Retrieval-Augmented Recognition (RAR) and Domain-Adaptive Recognition (DAR). On the one hand, the training-free RAR utilizes a vision model to retrieve relevant image-category pairs from an image-category memory pre-built from the training set, thus incorporating the categorical information into the input of LVLMs to enhance food recognition performance. On the other hand, DAR employs a two-stage training process by first pre-training LVLMs on diverse food analysis tasks and then fine-tuning LVLMs using food recognition data. Extensive evaluations on two large-scale food recognition datasets demonstrate that both RAR and DAR improve the food recognition performance of LVLMs and, compred to RAR, DAR achieves a higher precision that outperforms traditional vision models.