Organizing Unstructured Image Collections using Natural Language

Published: 01 Jan 2024, Last Modified: 16 May 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Organizing unstructured visual data into semantic clusters is a key challenge in computer vision. Traditional deep clustering approaches focus on a single partition of data, while multiple clustering (MC) methods address this limitation by uncovering distinct clustering solutions. The rise of large language models (LLMs) and multimodal LLMs has enhanced MC by allowing users to define text clustering criteria. However, expecting users to manually define such criteria for large datasets before understanding the data is impractical. In this work, we introduce the task of Open-ended Semantic Multiple Clustering, that aims to automatically discover clustering criteria from large, unstructured image collections, uncovering interpretable substructures without requiring human input. Our framework, X-Cluster: eXploratory Clustering, uses text as a proxy to concurrently reason over large image collections, discover partitioning criteria, expressed in natural language, and reveal semantic substructures. To evaluate X-Cluster, we introduce the COCO-4c and Food-4c benchmarks, each containing four grouping criteria and ground-truth annotations. We apply X-Cluster to various real-world applications, such as discovering biases and analyzing social media image popularity, demonstrating its utility as a practical tool for organizing large unstructured image collections and revealing novel insights. We will open-source our code and benchmarks for reproducibility and future research.
Loading