Abstract: Medical image based diagnosis often requires classification of images at sub-class level, which is essentially a fine-grained visual classification (FGVC) problem. Surprisingly, few prior works have considered this problem from the perspective of FGVC. Motivated by this fact, we present in this paper an FGVC method to boost the classification performance in the context of otitis media diagnosis with endoscopic tympanic membrane images. Our proposed method works in a weakly-supervised fashion, which only takes as input image-level class labels, without the necessity of expensive part annotations. An image-level convolutional neural network (C-NN) is first trained, which can generate saliency maps. The saliency maps can be used to localize discriminative local patches, over which another patch-level CNN can be trained. Both image-level and patch-level CNNs are then integrated for performance boosting. Experiments on real clinical data demonstrate that the proposed method can achieve promising performance.