# VFedCD

## Abstract
  Causal discovery seeks to identify causal relationships among data attributes, typically represented as causal graphs with attributes as vertices and causal relationships as directed edges. However, in many real-world applications, data are often vertically partitioned across multiple parties, with each party holding only a subset of the complete attribute set.  In these scenarios, aggregating all attributes at a single party is generally prohibited due to privacy concerns, rendering centralized approaches to causal discovery infeasible. To address this challenge, we propose a vertical federated learning framework for causal discovery, named VFedCD, that enables multiple parties to collaboratively infer causal relationships among their attributes in a distributed manner, without sharing their raw data. Specifically, each party trains an encoder that transforms its local attributes into a set of features, each predicting a certain attribute in the complete attribute set. These features are then transmitted to the respective parties that possess the corresponding target attributes. In the meantime, each party also trains a decoder that aggregates received features from all parties to predict its local attributes. For a given party, all the potential effects of its local attributes are identified by examining the parameters of its encoder. To better capture inter-party causal mechanisms, we redesign the conventional architecture and design a Secure Dispatch Protocol (SDP) adapted to this architecture. The SDP combines semi-homomorphic encryption with secret sharing to ensure secure feature interaction and gradient propagation. Additionally, we develop a Centralized Topology Validator (CTV) that aggregates local subgraphs from parties and enforces global acyclicity constraints, thereby preventing cyclic or overly dense graphs. Experiments on synthetic and real-world datasets show that VFedCD achieves causal discovery accuracy comparable to centralized methods while providing privacy guarantees, validating its effectiveness in vertical federated scenarios.

## Project Structure

VFedCD/     
├── share_dataset/               `Dataset storage directory`    
│   └── README.md                `Documentation for dataset organization`   
├── requirements.txt             `Python environment dependencies`      
└── src/                         `Core source code`         
│    ├── main_pipeline.py         `Entrypoint for the main pipeline`     
│    ├── configs/                 `Task configuration files (JSON format)`       
│    ├── dataset/                 `Dataset loading and preprocessing modules`    
│    ├── evaluates/               `Evaluation algorithms and metrics`    
│    ├── exp_result/              `Output directory for experiment results`  
│    ├── framework/               `Core federated learning framework components`     
│    ├── load/                    `Module for loading configurations, data, models, etc.`    
│    ├── log/                     `Output directory for log files`   
│    ├── models/                  `Model definitions (e.g., neural networks)`    
│    ├── party/                   `Definitions for participating parties (clients/servers)`  
│    ├── third_party/             `Third-party tools/libraries (e.g., legacy code, utilities)`   
│    ├── utils/                   `Reusable utility functions (e.g., I/O, math helpers)`     
│    └── centralized/             `Implementations of centralized baseline methods`  


**Key Workflow of `main_pipeline.py`**:  
The main pipeline sequentially:  
1. Loads task configurations from `configs/` using modules in `load/`;  
2. Initializes datasets from `dataset/`;  
3. Prepares participating parties (clients/servers) from `party/`;  
4. Loads model architectures from `models/`;  
5. Executes the core algorithm via `evaluates/`;  
6. Saves results to `exp_result/` and logs to `log/`.  


## Quick Start

### 1. Environment Setup
First, create and activate a Python environment:
```bash
conda create -n VFedCD python=3.8
conda activate VFedCD
pip install --upgrade pip
cd VFedCD  # Navigate to the project root
pip install -r requirements.txt
```
### 2.Run VFedCD
#### Step 1: Prepare Configuration Files     
All task configurations are stored as JSON files in src/configs/. For details on the configuration format (e.g., hyperparameters, dataset paths), refer to src/configs/README.md.   
#### Step 2: Execute the Pipeline    
Run the main pipeline with your configuration:
```bash
cd src
python main_pipeline.py --configs my_config  # Replace "my_config" with your config name
```
For a quick test, use the minimal demo configuration:
```bash
python main_pipeline.py --configs demo  # Uses "src/configs/demo.json"
```
#### Step 3: Run Centralized Baselines  
To execute centralized methods (for benchmarking), navigate to the centralized directory and run:
```bash
cd src/centralized
python benchmark.py --seed 0 --model all --force True --dataset 4
```
Parameters:
- seed: Random seed for reproducibility (e.g., 0).
- model: Models to run (use all to execute all predefined models).
- force: Overwrite existing results if True.
- dataset: Dataset name.
