README: Demographic Representation Analysis
===========================================

This package contains scripts for reproducing the demographic estimation and
over-/under-representation analyses described in the paper. All scripts are
self-contained and anonymized for the review process.

----------------------------------------------------------------------
Requirements
----------------------------------------------------------------------

- Python 3.9 or higher
- pandas
- tqdm
- openai  (for LLM-based demographic estimation)

Install with pip if needed:

    pip install pandas tqdm openai

----------------------------------------------------------------------
Files
----------------------------------------------------------------------

1. generate_demographic_estimates.py
   - Reads a .txt file of prompts (one per line).
   - Queries a language model (e.g., GPT-4o) to estimate demographic
     distributions (gender and race) expected for each prompt.
   - Output: CSV file with one row per prompt containing percentage estimates
     for each demographic category.

2. over_underrepresentation_calculation.py
   - Takes as input:
       (a) CSV of LLM demographic estimates (output of step 1)
       (b) CSV of model demographic outputs (aggregated from FairFace or a
           comparable classifier).
   - Merges the two sources and computes under-/over-representation metrics
     for each demographic category.
   - Output: CSV file with aligned results.

----------------------------------------------------------------------
Usage
----------------------------------------------------------------------

Step 1: Generate demographic expectations with the LLM

    python generate_demographic_estimates.py \
        --input_txt prompts.txt \
        --output_csv llm_demographics.csv \
        --delay 1.0

Notes:
- `prompts.txt` should contain one prompt per line.
- `--delay` controls the pause between API calls (default 1 second).
- Requires a valid OpenAI API key provided as the environment variable:
      export OPENAI_API_KEY=your_key_here

Step 2: Compute under-/over-representation

    python over_underrepresentation_calculation.py \
        --llm_csv llm_demographics.csv \
        --fairface_csv fairface_aggregated.csv \
        --output_csv under_over.csv

Notes:
- The FairFace CSV must contain demographic proportions for each generated
  image or prompt, aggregated to the same categories.
- The script aligns both sources by `prompt_index`.

----------------------------------------------------------------------
Outputs
----------------------------------------------------------------------

- llm_demographics.csv       (LLM estimates of expected demographics)
- under_over.csv             (Final under-/over-representation metrics)

----------------------------------------------------------------------
