Abstract: Pose variation and subtle differences in appearance are key challenges to fine- grained classification. While deep networks have markedly improved general recognition, many approaches to fine-grained recognition rely on anchoring net- works to parts for better accuracy. Identifying parts to find correspondence dis- counts pose variation so that features can be tuned to appearance. To this end previous methods have examined how to find parts and extract pose-normalized features. These methods have generally separated fine-grained recognition into stages which first localize parts using hand-engineered and coarsely-localized pro- posal features, and then separately learn deep descriptors centered on inferred part positions. We unify these steps in an end-to-end trainable network supervised by keypoint locations and class labels that localizes parts by a fully convolutional network to focus the learning of feature representations for the fine-grained clas- sification task. Experiments on the popular CUB200 dataset show that our method is state-of-the-art and suggest a continuing role for strong supervision.
CMT Id: 334
Conflicts: eecs.berkeley.edu, snapchat.com