Abstract: Nutrition plays a vital role in overall health and well-being. With a highly accurate nutrient estimation model, we develop a tool that displays nutritional values from food images, thereby reducing the labor-intensiveness of dietary assessment. We propose a method that uses depth data with RGB images and incorporates an open-vocabulary segmentation process that separates food from non-food instances, coupled with two-stage self-attention Transformer decoder. Our model outperforms the current state-of-the-art method, with an average percent MAE of 17.2% on Nutrition5k, an RGB-D food image dataset with calories, mass, and three macronutrients annotated. Our study also focuses on the significance of the food and background regions for calorie, mass, and nutrient estimation. We analyze the impact of non-food regions on each estimation task, with results suggesting that background information is crucial for calorie, mass, and carbohydrate estimation but not as essential for protein and fat estimation. The qualitative results also show that the model attends to regions with a high corresponding nutritional value. Implementation codes and pre-trained models are provided at https://github.com/Oatsty/nutrition5k.
Loading