FoodAgent: A Multi-modal Mixture of Experts Reasoning Agent for Divide-and-Conquer Food Nutrition Estimation

Pengfei Zhang; Yutong Song; Chenhan Lyu; Ziyu Wang; Amir M. Rahmani

FoodAgent: A Multi-modal Mixture of Experts Reasoning Agent for Divide-and-Conquer Food Nutrition Estimation

Pengfei Zhang, Yutong Song, Chenhan Lyu, Ziyu Wang, Amir M. Rahmani

Published: 19 Aug 2025, Last Modified: 24 Sept 2025BSN 2025EveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the IEEE BSN 2025 conference submission's policy on behalf of myself and my co-authors.

Keywords: Nutrition estimation, Large Language Model, Reasoning Agent, Mixture of Expert, Retrieval Augmented Generation

Abstract: Estimating nutrition from food images remains a challenging task, particularly for complex, multi-component dishes. While computer vision methods are effective at recognizing food elements, they typically treat entire meals as monolithic inputs, lacking the ability to decompose visual scenes into individual components. Large language models (LLMs), in contrast, offer strong identification and qualitative reasoning capabilities but struggle with quantitative estimation, especially for assessing volume and mass of individual elements. In this work, we propose FoodAgent, a multi-modal Mixture-of-Experts (MoE) reasoning framework that improves nutrition estimation through a divide-and-conquer strategy. By decomposing dishes into distinct food components, FoodAgent dynamically routes each element to one of three specialized expert modules: (1) monocular volume estimation for nutritionally important and visually clear elements, (2) Retrieval-Augmented Generation (RAG) for important but not clear elements, and (3) direct LLM inference for minor components. This conditional expert selection aligns estimation strategies with the visual and semantic characteristics of each food element, significantly reducing cumulative errors. Experiments show that our element-wise, MoE-driven approach outperforms holistic methods, especially in real-world dietary scenarios involving diverse and complex meals.

Track: 13. General sensing and systems

Tracked Changes: pdf

NominateReviewer: Pengfei Zhang Yutong Song

Submission Number: 105

Loading