Data Diversity for Compositional Generalization

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: diversity, data-centric AI, compositionality
Abstract: Human cognition excels at understanding complex concepts by combining simpler, learned elements, enabling efficient learning and generalization to novel scenarios. Recent work suggests that machine learning models may exhibit a similar capability, generalizing to novel scenarios by first acquiring fundamental components and then recombining them. Data serves as the driving force behind this process, and the diversity of training data plays a crucial role in shaping a model's ability to generalize. In this work, we introduce a framework that disentangles the multifaceted notion of diversity and formalize its impact on model performance and generalization ability from different perspectives. Through both theoretical analysis and empirical validation, we demonstrate that increasing diversity without a principled strategy does not necessarily lead to optimal generalization ability. Instead, a deeper understanding of data diversity is required. Building on this insight, we propose a high-level guideline for dataset designing and preparing that facilitate more efficient learning and enable improved generalization to unseen compositions.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 7086
Loading