FoodAgent: A Multi-modal Mixture of Experts Reasoning Agent for Divide-and-Conquer Food Nutrition Estimation
Confirmation: I have read and agree with the IEEE BSN 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Nutrition estimation, Large Language Model, Reasoning Agent, Mixture of Expert, Retrieval Augmented Generation
Abstract: Estimating nutrition from food images remains a challenging task, particularly for complex, multi-component dishes. While computer vision methods are effective at recognizing food elements, they typically treat entire meals as monolithic inputs, lacking the ability to decompose visual scenes into individual components. Large language models (LLMs), in contrast, offer strong identification and qualitative reasoning capabilities but struggle with quantitative estimation, especially for assessing volume and mass of individual elements. In this work, we propose FoodAgent, a multi-modal Mixture-of-Experts (MoE) reasoning framework that improves nutrition estimation through a divide-and-conquer strategy. By decomposing dishes into distinct food components, FoodAgent dynamically routes each element to one of three specialized expert modules: (1) monocular volume estimation for nutritionally important and visually clear elements, (2) Retrieval-Augmented Generation (RAG) for important but not clear elements, and (3) direct LLM inference for minor components. This conditional expert selection aligns estimation strategies with the visual and semantic characteristics of each food element, significantly reducing cumulative errors. Experiments show that our element-wise, MoE-driven approach outperforms holistic methods, especially in real-world dietary scenarios involving diverse and complex meals.
Track: 13. General sensing and systems
Tracked Changes: pdf
NominateReviewer: Pengfei Zhang
Yutong Song
Submission Number: 105
Loading