Learning What Matters by Imagining What Doesn't

Siqi Li

Published: 30 Mar 2026, Last Modified: 06 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: A central obstacle in generalized category discovery is that deep networks latch onto any statistical regularity that separates the data, not necessarily the one that corresponds to object identity. Images are rich composites where category-relevant structure coexists with incidental variation in viewpoint, occlusion, and scene layout. Standard discovery pipelines absorb this entire mélange into a single feature vector, granting equal representational standing to the object and its backdrop. We challenge this practice by positing a causal mediation framework where foreground semantics transmit the true discovery signal while presentation variables operate as background confounders. The technical core of our approach is a two-phase factorization pipeline: an initial unsupervised phase carves unlabeled images into a shared vocabulary of semantic atoms under consistency pressures that discourage fragmentation, and a subsequent refinement phase tunes these atoms for reconstruction fidelity without shifting their learned semantics. With a foreground-background separation obtained, we operationalize counterfactual reasoning by editing out background contexts and perturbing geometric parameters in the original images, generating intervened views where only causal content persists unchanged. These views serve as an auxiliary curriculum that penalizes reliance on non-causal features during clustering. We evaluate the proposed module by retrofitting it into multiple representative GCD architectures, observing consistent performance uplifts on established coarse-to-fine grained benchmarks. The results suggest that even a modest causal structuring of the representation space yields outsized benefits when the task demands generalization to categories never seen during supervised training.