# Datasets

This folder documents the datasets and concept definitions used in the experiments across four modalities: images, tables, text, and multimodal data. Each modality has a matching loader in this package.

## Experimental protocol (summary)

For each run, we vary the number of random examples N from 10 to 300. At each N, we compute s=50 separate CAVs. Each CAV is trained on N examples sampled with replacement from a large pool of samples (1,000 for images; up to 50,000 for text). We then compute the trace of the covariance matrix of the 50 CAVs. We repeat this experiment r=10 times for statistical significance and report mean and standard deviation across runs.

## Image data

We follow the standard TCAV setup: ImageNet for target classes with concept definitions sourced from Broden. The main experiments use a pre-trained ResNet-50 (layers `layer2` and `layer3`), with additional runs on GoogLeNet, MobileNetV3, and ViT-B/16 showing comparable variance scaling.

Place ImageNet targets, Broden concepts, and random sets under:

- `TCAV_Images/target`
- `TCAV_Images/concepts`
- `TCAV_Images/random`

## Tabular data

We adapt TCAV to the UCI Adult income dataset to show the method applies beyond vision. Concepts are defined directly from the `sex` attribute: `male` and `female`. The model is a two-layer feed-forward network trained to predict whether income exceeds $50,000, and we extract CAVs from both hidden layers.

## Text data

We use the IMDB sentiment dataset with a pre-trained text classifier (as in the Captum notebooks). Concepts are defined by hand-picked sets of positive, negative, and neutral adjectives. We extract token embeddings from convolutional layers `convs.1` and `convs.2`.

The text data is provided in this repo under `TCAV_Text`:

- `TCAV_Text/target`
- `TCAV_Text/concepts`
- `TCAV_Text/random`

## Multimodal data

We evaluate vision-language models using CLIP. We pair zebra images with the prompt "a photo of a zebra" and use image-text similarity as the scalar output. Concepts are defined by image sets as in the vision experiments, and we extract CAVs at layers 4, 8, and 12.

## Extending datasets and concepts

You can add more target classes or concepts by extending the folder structure above. For downloading or organizing concept datasets, see Kim et al.'s TCAV implementation on GitHub.

## Data availability

We provide the text data. Image data must be downloaded from ImageNet and Broden and placed in the corresponding folders under `TCAV_Images`.
