Content combination strategies for Image Classification. (Stratégies de combinaison de contenus pour la classification d'images)

Abstract: In this thesis, we tackle the question of deep image classification, a fundamental issue for computer vision and visual understanding in general. We look into the common practice of engineering new examples to augment the dataset. We take this as an opportunity to teach neural algorithms to reconcile information mixed from different samples with Mixing Sample Data Augmentation so as to better understand the problem. To this end, we study both how to edit the content in a mixed image, and what the model should predict for the mixed images. We first propose a new type of data augmentation that helps model generalize by embedding the semantic content of samples into the non-semantic context of other samples to generate in-class mixed samples. To this end, we design new neural architectures capable of generating such mixed samples, and then show the resulting mixed inputs help train stronger classifiers in a semi-supervised setting where few labeled samples are available. In a second part, we show input mixing can be used as an input compression method to train multiple subnetworks in a base network from compressed inputs. Indeed, by formalizing the seminal multi-input multi-output (MIMO) framework as a mixing data augmentation and changing the underlying mixing mechanisms, we obtain strong improvements of over standard models and MIMO models. Finally, we adapt this MIMO technique to the emerging Vision Transformer (ViT) models. Our work shows ViTs present unique challenges for MIMO training, but that they are also uniquely suited for it.
0 Replies
Loading