# Disentangled Information Quantification for Dataset Construction in Data-Centric AI 

> **DGC** is a model-agnostic dataset construction framework that selects high-utility subsets by maximizing **semantic coverage** in a disentangled latent space. It integrates:
> - **OF-SQAE** (Orthogonal Factorization Soft-Quantization Autoencoder) for stable, interpretable semantic factors  
> - **IAIQ** (Invariant Attribute Information Quantification) for per-sample informativeness  
> - A **coverage-driven greedy selector** to choose representative samples under a fixed budget

<img src="./imgs/DCG-cropped.pdf" alt="image-20250922210833084,w" style="zoom:200%;" />

---

## Table of Contents
- [Abstract](#abstract)
- [Repository Structure](#repository-structure)

---

## Abstract

The rise of large-scale models has placed greater emphasis on Data-Centric AI (DCAI), which aims to build high-quality, scalable, and sustainable data assets. However, prevailing dataset construction relies on active learning with model-dependent criteria (e.g., uncertainty or confidence), typically anchored to a specific evaluation model. This coupling ties sample selection to architecture- and training-specific behavior, undermining transferability and reusability.

We introduce **Disentangled Generalizable Construction (DGC)**, a model-agnostic data selection method that, under a fixed sample budget, maximizes **semantic coverage** by selecting subsets that faithfully represent a dataset’s key semantic factors, thereby preserving utility across models and training settings. As supporting modules, the **Orthogonal Factorization Soft-Quantization Autoencoder (OF-SQAE)** yields stable, interpretable semantic factors; **Invariant Attribute Information Quantification (IAIQ)** quantifies per-sample informativeness in the latent space; and a **coverage-driven greedy selector** chooses representative samples. Experiments on natural-image datasets and evaluations across diverse model architectures show strong results, indicating the generality and robustness of DGC.

---

## Repository Structure

    source_code/
    ├─ disentangle/
    │  ├─ main.py                    # Train OF-SQAE (disentangled latent factors)
    │  └─ preciseDisentangled.py     # Dimension traversal + IAIQ estimation
    │
    ├─ eval_information/
    │  ├─ verifyImformation.py       # Quantify information of selected subsets (IAIQ on chosen data)
    │  └─ InformationEval.py         # IAIQ–performance correlation/evaluation
    │
    ├─ data_select/
    │  └─ train_main.py              # Generalization eval across backbones using selected subsets
    │
    └─ dataset_select/               # <-- Place constructed subset files here (CSV/JSON/NPY)
    imgs/
    └─ DCG-cropped.pdf               # Paper overview figure (PDF)

> Note: The filename `verifyImformation.py` is intentional (matches the original codebase).

