# CFGNN: Causal Fair Graph Neural Networks

This repository contains the implementation of Causal Fair Graph Neural Networks (CFGNN), a novel approach for learning fair representations in graph-based machine learning models using causal intervention methods.

## Installation

### Conda Environment Setup
We provide a conda environment file `env.yml` that contains all the necessary dependencies. To set up the environment:

```bash
# Create and activate conda environment
conda env create -f env.yml
conda activate CFGNN
```

## Usage

The training pipeline consists of three main steps:

1. Train the base GNN model:
```bash
python model_training/train_GNN.py
```

2. Train the conditional VAE:
```bash
python model_training/train_CVAE.py
```

3. Train the fair classifier:
```bash
python model_training/train_cd.py
```

For baseline comparisons, use:
```bash
python model_training/train_fairgnn.py  # FairGNN baseline
python model_training/train_gear.py     # GEAR baseline
```

## Project Structure

```
├── args.py                 # Command line arguments
├── causaldifference.py    # Causal difference implementation
├── intervention.py        # Intervention mechanisms
├── riskdifference.py     # Risk difference metrics
├── baselines/            # Baseline implementations
│   ├── FairGNN.py
│   └── GEAR.py
├── datasets/             # Dataset loaders
│   ├── adult.py
│   ├── credit.py
│   └── german_real.py
├── model_training/       # Training scripts
│   ├── train_cd.py
│   ├── train_CVAE.py
│   └── train_GNN.py
├── models/              # Model architectures
│   └── cvae.py
└── util/                # Utility functions
    ├── data_util.py
    ├── general_util.py
    └── mapping_function.py
```

## Implementation Details

### Model Architecture

#### Conditional VAE (CVAE)
The CVAE architecture consists of:
- An encoder network that maps input features to a latent space
- A decoder network that reconstructs the input from the latent representation
- A Graph Convolutional Network (GCN) for processing sensitive attributes
- Latent space dimensionality: Configurable through `latent_dim` parameter
- Hidden layers: [64, 32] with ReLU activation

#### Causal Intervention
The framework implements causal intervention through the `Intervention` class, which performs counterfactual reasoning on sensitive attributes. This allows the model to understand and mitigate the causal effect of sensitive features on predictions.

The intervention mechanism:
1. Generates counterfactual samples by manipulating sensitive attributes
2. Measures the causal effect through differences in model predictions
3. Uses this information to enforce fairness constraints during training

### Training Configuration

#### CVAE Training
- Optimizer: Adam (lr=1e-2)
- Learning rate scheduler: CosineAnnealingLR
- Loss function: Combination of reconstruction loss (MSE) and KL divergence
- Training visualization: Loss curves saved as 'loss_training.png'
- Pre-trained GNN weights loaded from 'weights/gnn.pt'

#### Fair Classifier Training
- Optimizer: Adam (lr=1e-2)
- Learning rate scheduler: CosineAnnealingLR
- Loss function: BCE + λ * fairness_loss
- Best model saved based on validation accuracy
- Training epochs: 100

### Configuration Options

The framework provides extensive configuration options through command-line arguments:

#### Model Architecture
- `--latent_dim` (default=16): Dimension of CVAE latent space
- `--hidden_channels` (default=8): GNN hidden layer dimensions
- `--num_layers` (default=1): Number of GNN layers
- `--A` (default=5): Dimension of graph attention vectors
- `--input_size` (default=31): Dimension of input features

#### Training Parameters
- `--cvae_num_epoch` (default=800): Number of CVAE training epochs
- `--lf`: Fairness loss weight (λ)
- `--seed` (default=110): Random seed for training
- `--gtseed` (default=117): Random seed for ground truth GNN
- `--protect` (default=1): Index of disadvantaged group

#### Dataset Configuration
- `--dataset` (default='credit'): Name of the dataset
- `--root` (default='data/credit'): Directory of data root
- `--filename`: Name of data file
- `--mapping_function` (default='LinearMapping'): Mapping function from features to labels
- `--k` (default=101): k-neighbors for KNN graph construction

#### Synthetic Data Generation
- `--gen_A` (default=1): Dimension of A in synthetic dataset
- `--scale` (default=0.1): Noise scale for generated data
- `--coff_A` (default=1.0): Coefficient for attention mechanism

Example usage:
```bash
python model_training/train_CVAE.py --dataset credit --latent_dim 32 --cvae_num_epoch 1000 --lf 0.1
```

You can also specify configurations through a config file:
```bash
python model_training/train_CVAE.py --configs path/to/config.json
```

### Fair Classifier
The fair classifier is trained using a combination of classification loss and fairness constraints:
- Classification Loss: Binary Cross Entropy
- Fairness Loss: Causal Effect Difference
- Combined Loss: L = L_ce + λ * L_fair

### Evaluation Metrics
- Risk Difference (RD)
- Causal Difference (CD)
- Classification Accuracy
- Group Fairness Metrics

## Datasets

The framework supports multiple datasets:

### Credit Default Dataset (Default)
- Features:
  * Numerical: Payment history, bill amounts, payment amounts, age
  * Categorical: Education (one-hot), marriage status (one-hot)
  * Sensitive Attribute: Gender (SEX)
  * Target Variable: Default payment next month
- Preprocessing:
  * Categorical features: One-hot encoded
  * Numerical features: Standardized using StandardScaler
  * Graph Construction: KNN graph with k=101 neighbors
  * Train/Test Split: 80%/20% random split
  * Data Location: 'data/credit/raw/UCI_Credit_Card.csv'

### Adult Income Dataset
- Binary classification task for income prediction
- Features include education, occupation, age, etc.
- Sensitive attribute: Gender
- Target: Income >50K or ≤50K

### German Credit Dataset
- Credit risk assessment task
- Features include credit history, employment, housing, etc.
- Sensitive attribute: Age
- Target: Good/Bad credit risk

Each dataset is preprocessed to ensure:
- Proper handling of categorical and numerical features
- Feature standardization for consistent scales
- Graph structure creation using KNN
- Fair representation of sensitive attributes
- Removal of isolated nodes

## License

This project is licensed under the GNU General Public License v3.0 - see below for details:

```
Copyright (C) 2024

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
```

For the full license text, please see the [GNU GPL v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html) website.
