Evaluating Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Large Multimodal Model, Nutrition Analysis, Portion Estimation, Prompt Engineering
Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. To this end, we introduce a meal image dataset slated for public release, pairing dietitian-verified energy, macronutrients, portions, and three metadata facets: gps coordinates (location), timestamp (meal-time), and food-item lists. We further benchmark eight state-of-the-art LMMs (four open-weight, four closed-weight), showing that interpreting and adding metadata to the prompt consistently reduces mean-absolute and mean-absolute-percentage errors relative to baseline prompting. Incorporating interpreted metadata also amplifies the gains of reasoning modifiers such as Chain-of-Thought (CoT), Multimodal CoT, Scale-Hint, Few-Shot, and Expert-Persona. This work highlights the potential of context-aware LMMs for improved nutrition analysis, demonstrating that the consistent and measurable improvements observed emphasize a notable role of contextual metadata in enhancing nutrition estimations.
Track: 5. Public Health Informatics
Registration Id: 83N63RRN38B
Submission Number: 251
Loading