Keywords: compositional zero-shot learning, visual product, multi-task learning
TL;DR: The naive visual-only baseline when trained properly can surpass state-of-the-art performance on compositional zero-shot learning
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize compositions of objects and states in images, and generalize to the unseen compositions of objects and states. Recent works tackled this problem effectively by using side information (e.g., word embeddings) together with either consistency constraints or specific network designs modeling the relationships between objects, states, compositions, and visual features. In this work, we take a step back, and we revisit the simplest baseline for this task, i.e., Visual Product (VisProd). VisProd considers CZSL as a multi-task problem, predicting objects and states separately. Despite its appealing simplicity, this baseline showed low performance in early CZSL studies. Here we identify the two main reasons behind such unimpressive initial results: network capacity and bias on the seen classes. We show that simple modifications to the object and state predictors allow the model to achieve either comparable or superior results w.r.t. the recent state of the art in both the open-world and closed-world CZSL settings on three different benchmarks.