# CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction

## Overview

CausalStock addresses two main challenges in news-driven multi-stock movement prediction: discovering unidirectional causal relations among stocks and extracting effective information from noisy news data. It integrates a lag-dependent temporal causal discovery mechanism and a denoised news encoder leveraging large language models (LLMs) to predict stock movements accurately.

## Data

All data files required for running the experiments are stored in the `data` directory. This includes both raw and processed datasets necessary for training and evaluation.

## Environment Dependencies

### Poetry

We use Poetry to manage the project dependencies, which are specified in the `pyproject.toml` file. To install Poetry, run:

```sh
curl -sSL https://install.python-poetry.org | python3 -
```

To install the environment, run:

```sh
poetry install
```

This will create a virtual environment that you can use by running either `poetry shell` or `poetry run {command}`. It's also a virtual environment that you can interact with in the normal way too.

## Running Instructions

### Training the Model

To train the CausalStock model on the provided dataset, use the following command:

```sh
python -m causalstock.run_experiment stock_news_dataset --model_type causalstock_spline --model_config configs/causalstock/true_graph_causalstock_spline.json -dc configs/dataset_config_causal_dataset.json -c -te
```

### Generating Data

To generate synthetic data for experiments, use the following command:

```sh
python -m causalstock.data_generation.generate_synthetic_data
```

### Evaluating the Model

After training, evaluate the causal discovery and (C)ATE estimation performance with:

```sh
python -m causalstock.run_experiment stock_news_dataset --model_type causalstock_spline --model_config configs/causalstock/true_graph_causalstock_spline.json -dc configs/dataset_config_causal_dataset.json -c -te
```

## Model Description

CausalStock uses an additive noise structural equation model (ANM-SEM) to capture functional relationships among variables and exogenous noise, simultaneously learning a variational distribution over causal graphs. It combines a denoised news encoder and a lag-dependent temporal causal discovery module to predict stock movements.

### Key Components

- **Market Information Encoder (MIE)**: Encodes news text and price features.
- **Lag-dependent Temporal Causal Discovery (Lag-dependent TCD)**: Discovers causal relationships among stocks.
- **Functional Causal Model (FCM)**: Predicts future stock movements based on the discovered causal graph.

## Hyperparameters

### Model Configurations

- `lambda_dag`: Coefficient for the prior term that enforces the learnt graph to be a DAG.
- `lambda_sparse`: Coefficient for the prior term enforcing similarity to the prior matrix W_0 or sparsity if the prior matrix is empty.
- `tau_gumbel`: Temperature for the gumbel softmax trick.
- `spline_bins`: Number of bins for spline flow base distribution.
- `layers_imputer`: Number and size of hidden layers for the imputer neural network.
- `var_dist_A_mode`: Variational distribution for the adjacency matrix.

### Training Parameters

- `learning_rate`: Learning rate for the initial steps of augmented Lagrangian optimization.
- `batch_size`: Size of the training batches.
- `standardize_data_mean`: Whether to center the data.
- `standardize_data_std`: Whether to standardize the data to have unit variance.
- `rho`: Initial rho for the augmented Lagrangian procedure.
- `alpha`: Initial alpha for the augmented Lagrangian procedure.
- `safety_rho`: Maximum allowed value for rho.
- `safety_alpha`: Maximum allowed value for alpha.
- `max_steps_auglag`: Maximum number of augmented Lagrangian steps.
- `max_p_train_dropout`: Maximum probability for data dropout during training for regularization.
- `anneal_entropy`: Method for annealing the entropy term in the ELBO.

## Reproducing Results

To reproduce the results from our paper, generate all required data and run the simulations as described. Use the following command to reproduce results:

```sh
python -m causalstock.run_experiment stock_news_dataset --model_type causalstock_spline --model_config configs/causalstock/true_graph_causalstock_spline.json -dc configs/dataset_config_causal_dataset.json -c -te
```