Abstract: Recently, food recognition is gaining more attention in the multimedia community due to its various applications, e.g., multimodal foodlog and personalized healthcare. Most of existing methods directly extract visual features of the whole image using popular deep networks for food recognition without considering its own characteristics. Compared with other types of object images, food images generally do not exhibit distinctive spatial arrangement and common semantic patterns, and thus are very hard to capture discriminative information. In this work, we achieve food recognition by developing an Ingredient-Guided Cascaded Multi-Attention Network (IG-CMAN), which is capable of sequentially localizing multiple informative image regions with multi-scale from category-level to ingredient-level guidance in a coarse-to-fine manner. At the first level, IG-CMAN generates the initial attentional region from the category-supervised network with Spatial Transformer (ST). Taking this localized attentional region as the reference, IG-CMAN combined ST with LSTM to sequentially discover diverse attentional regions with fine-grained scales from ingredient-guided sub-network in the following levels. Furthermore, we introduce a new dataset ISIA Food-200 with 200 food categories from the list in the Wikipedia, about 200,000 food images and 319 ingredients. We conducted extensive experiment on two popular food datasets and newly proposed ISIA Food-200, and verified the effectiveness of our method. Qualitative results along with visualization further show that IG-CMAN can introduce the explainability for localized regions, and is able to learn relevant regions for ingredients.
0 Replies
Loading