Abstract: Large vision language models (VLMs) excel in visual tasks but struggle with weight estimation, hindering 3D perception and embodied intelligence. To address the lack of large-scale weight datasets, we present FoodWeight1.4M, derived from real-world supermarket scenarios. It contains 1.4 million high-quality images across 1,550 food categories, with weights precisely measured and rigorously filtered, making it the first large-scale weight estimation dataset. The weight estimation performance of current VLMs were tested and found to be unsatisfactory, which can be significantly improved by instruction tuning using Food-Weight1.4M. Moreover, we propose two strategies, Category-Guided and Reference Calibration, to enhance weight estimation without fine-tuning. Experiments confirm their effectiveness in improving multi-modal weight perception. Furthermore, experimental results show that pre-training on FoodWeight1.4M can benefit other food analysis tasks. Our dataset will be publicly available soon.
External IDs:dblp:conf/icmcs/YuanXMYXWH25
Loading