# ECAM: Enhancing Causal Reasoning in Foundation Models with Endogenous Causal Attention Mechanism - Experimental Code

This repository contains the experimental code for the paper "ECAM: Enhancing Causal Reasoning in Foundation Models with Endogenous Causal Attention Mechanism". It includes the implementation of the ECAM model, scripts for data generation and processing, experiment execution, and results visualization.

## Project Overview

The ECAM model aims to improve the causal reasoning capabilities of foundation models by incorporating an endogenous causal attention mechanism. This mechanism learns a local causal graph from the input context and uses it to modulate the standard attention scores, thereby guiding the model to focus on causally relevant information. The experiments evaluate ECAM's performance on causal discovery, intervention effect estimation, and counterfactual reasoning compared to various baselines.

## Directory Structure

```
/home/ubuntu/ecam_project/
├── src/                  # Source code for the ECAM model components
│   ├── ecam.py           # Main ECAM module implementation (integrates GraphLearner, Intervention)
│   ├── graph_learner.py  # Causal graph learning component (basic implementation)
│   ├── intervention.py   # Intervention mechanism component (basic implementation)
│   └── counterfactual.py # Counterfactual mechanism component (placeholder)
├── scripts/              # Scripts for data processing, experiments, and visualization
│   ├── generate_synthetic_data.py
│   ├── process_tuebingen_data.py
│   ├── download_glue.py
│   ├── download_clutrr.py
│   ├── download_vqa.py     # (Note: VQA download encountered issues)
│   ├── download_gqa.py
│   ├── run_causal_discovery_exp.py
│   ├── run_intervention_exp.py
│   ├── run_counterfactual_exp.py
│   ├── run_glue_exp.py     # (Note: Encountered memory issues during training)
│   └── visualize_results.py
├── data/                 # Datasets (raw and processed)
│   ├── synthetic/          # Generated synthetic data
│   ├── real_world/       # Downloaded real-world datasets (Tübingen, GLUE, CLUTRR, GQA)
│   └── processed/          # Processed datasets (e.g., Tuebingen splits)
├── results/              # Experiment results, plots, and tables
│   ├── causal_discovery/
│   ├── intervention/
│   ├── counterfactual/
│   ├── glue/               # (Note: Incomplete due to memory issues)
│   ├── plots/              # Generated visualizations
│   └── tables/             # Generated summary tables
├── libs/                 # External libraries (e.g., notears)
│   └── notears/
└── README.md             # This file
```

## Dependencies

**Python Libraries:**

Install the required Python libraries using pip:
```bash
pip3 install torch torchvision torchaudio transformers networkx causal-learn scikit-learn matplotlib seaborn pandas datasets evaluate
pip3 install 'accelerate>=0.26.0'
```

**NOTEARS Library:**

The NOTEARS library (used for causal discovery baseline) needs to be installed from the locally cloned repository:
```bash
pip3 install --user /home/ubuntu/ecam_project/libs/notears
```

**Other Dependencies:**

Graphviz is required by `causal-learn` for visualization (though not strictly necessary for the core algorithms run here):
```bash
sudo apt-get update && sudo apt-get install -y graphviz
```

## Data Preparation

Scripts in the `scripts/` directory handle data preparation:

1.  **Synthetic Data:** Run `scripts/generate_synthetic_data.py`. Data is generated based on specified graph structures (ER, SF) and SCM types (linear).
2.  **Tübingen Cause-Effect Pairs:** Downloaded from Zenodo and processed using `scripts/process_tuebingen_data.py`.
3.  **GLUE (MNLI, RTE, QNLI):** Downloaded using `scripts/download_glue.py`.
4.  **CLUTRR:** Downloaded using `scripts/download_clutrr.py`.
5.  **GQA:** Downloaded using `scripts/download_gqa.py`.
6.  **VQA v2.0:** Attempts to download failed (`scripts/download_vqa.py`). GQA was used as an alternative.

Prepared datasets are stored under `/home/ubuntu/ecam_project/data/`.

## Model Implementation

The core ECAM model and its components (GraphLearner, InterventionModule) are implemented in the `src/` directory. The `ecam.py` module integrates these components. The `CounterfactualModule` remains a placeholder.

## Running Experiments

Experiment scripts are located in the `scripts/` directory. They can be run using `python3`:

*   `scripts/run_causal_discovery_exp.py`: Runs causal discovery experiments on synthetic and Tübingen data.
    *   *Status:* PC, NOTEARS, and ECAM (basic GraphLearner) ran. GES failed due to an 'Unknown function' error in the `causal-learn` library integration.
*   `scripts/run_intervention_exp.py`: Runs intervention effect estimation experiments on synthetic data.
    *   *Status:* Successfully ran, comparing a regression baseline (using true graph) and ECAM (using its learned graph + regression).
*   `scripts/run_counterfactual_exp.py`: Runs counterfactual reasoning experiments on synthetic data.
    *   *Status:* Calculates true counterfactuals. ECAM estimation was implemented using the Abduction-Action-Prediction framework with the learned graph, but often failed because the learned graph contained cycles, resulting in NaN MSE.
*   `scripts/run_glue_exp.py`: Attempts to run experiments on GLUE subsets.
    *   *Status:* Terminated due to likely memory/resource constraints.

Results are saved in corresponding subdirectories under `/home/ubuntu/ecam_project/results/`.

## Results and Visualization

The `scripts/visualize_results.py` script reads the raw result CSV files and generates plots and summary tables, saving them to `/home/ubuntu/ecam_project/results/plots/` and `/home/ubuntu/ecam_project/results/tables/` respectively.

Generated outputs include:
*   Causal discovery performance plots (SHD, F1) comparing models on synthetic data.
*   Causal discovery summary table comparing models on Tübingen data.
*   Intervention estimation MSE distribution plot (boxplot) comparing Regression and ECAM.
*   Intervention estimation scatter plots (Predicted vs. True) for Regression and ECAM.
*   Counterfactual reasoning summary table (statistics of true values).

## Usage

1.  Set up the environment and install dependencies as described above.
2.  Run the data preparation scripts (or ensure data exists in the `data/` directory).
3.  Run the desired experiment scripts from the `scripts/` directory.
4.  Run the `scripts/visualize_results.py` script to generate plots and tables from the results.

*Disclaimer: The ECAM model implementation is a basic version. 

