Spatial-Aware Multi-Modal Information Fusion for Food Nutrition Estimation

Dongjian Yu, Weiqing Min, Xin Jin, Qian Jiang, Shuqiang Jiang

Published: 27 Oct 2025, Last Modified: 04 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Food nutrition assessment plays a crucial role in maintaining health, preventing diseases, and promoting scientific dietary habits. However, existing nutrition assessment methods often fail to fully consider the relationships between tasks, leading to limited overall performance. Specifically, these methods suffer from three major challenges: (1) task conflicts, where different tasks compete during joint optimization, leading to suboptimal overall performance; (2) varying training difficulties among tasks, leading to imbalanced learning and subpar model generalization; and (3) the small-scale and complex distribution of datasets, which limits the robustness of learned representations. To address these issues, we propose a novel method that reduces interference between tasks, dynamically focuses on more challenging tasks, and incorporates 3D spatial awareness to enhance multi-modal feature representation. First, we decouple the prediction network from the backbone and introduce a CAMTH (Cross-Attention-Based Multi-Task Head Module), effectively mitigating task interference and fully leveraging each task's learning potential. Second, we improve the loss function to adaptively focus on more challenging tasks, improving overall model performance. Third, we design a 3D-FEM (3D Feature Extraction Module) and MMFF (Multi-Modal Feature Fusion Module), enabling the model to fully exploit the spatial information of food and enhance the food's multi-modal feature representation. We validate our method through extensive experiments on the Nutrition5K dataset, comparing it with state-of-the-art (SOTA) models. The results show that our method achieves superior performance in nutrition estimation, demonstrating the effectiveness of our method.
Loading