# A Set-Sequence Model for Time Series
![Model Architecture](./assets/set_seq_model_arch.jpg)
# About
In many financial prediction problems, the behavior of individual units (such as loans, bonds, or stocks) is influenced by observable unit-level factors and macroeconomic variables, as well as by latent cross-sectional effects. Traditional approaches attempt to capture these latent effects via handcrafted summary features.
We propose a Set-Sequence model that eliminates the need for handcrafted features. The Set model first learns a shared cross-sectional summary at each period. The Sequence model then ingests the summary-augmented time series for each unit independently to predict its outcome. Both components are learned jointly over arbitrary sets sampled during training.
This repository contains code for the Set-Sequence model and baselines.
## Paths
Set the BASE_PATH environment variable to the path of the repository in your environment.
Go to the root directory, do pwd, and add the path to the BASE_PATH environment variable.
```
export BASE_PATH="/path/to/set-seq-repo" 
```

Do not include a trailing slash. You will need to include the quotes " " around the path.
## Setup

### Conda Environment

With Conda:
```
conda env create -f environment.yml
```
Alternative with Conda and PIP:
```
conda create -n set-seq python=3.9
conda activate set-seq
pip install -r requirements.txt
```

## Weights and Biases Logging

The experiments are logged to [Wandb](https://wandb.ai/site).
You will need an account to view the logs with performance metrics. 

## Quickstart
The following command will train the set-sequence model on the synthetic data task with 1000 units without logging to wandb.
```
python -m train experiment=timeseries/synthetics/synthetics_set_seq.yaml ++wandb.mode=disabled
```
To see the metrics on wandb, you will need to remove the ++wandb.mode=disabled flag.

For testing purposes, the code can be run on CPU with the following command:
```
python -m train experiment=timeseries/synthetics/synthetics_set_seq.yaml ++wandb.mode=disabled ++trainer.accelerator=cpu
```
## Synthetic Data Experiments

### 1000 units, comparison with other sequence models
Running the below commands will generate the results for Table 1 in the paper.
We will see that the Set-Sequence Model outperforms all other models,
with an AUC around 0.8, compared with the second best model at 0.76.  


Set-Sequence Model (Ours):
```
python -m train experiment=timeseries/synthetics/synthetics_set_seq.yaml
```

MHA (per unit):
```
python -m train experiment=timeseries/synthetics/synthetics_mha.yaml
```

Hyena (per unit):
```
python -m train experiment=timeseries/synthetics/synthetics_hyena.yaml
```

S4 (per unit):
```
python -m train experiment=timeseries/synthetics/synthetics_s4.yaml
```

Long Conv (per unit):
```
python -m train experiment=timeseries/synthetics/synthetics_lc.yaml
```

H3 (per unit):
```
python -m train experiment=timeseries/synthetics/synthetics_h3.yaml
```

MHA (joint):
```
python -m train experiment=timeseries/synthetics/synthetics_mha_joint.yaml
```

Hyena (joint):
```
python -m train experiment=timeseries/synthetics/synthetics_hyena_joint.yaml
```

S4 (joint):
```
python -m train experiment=timeseries/synthetics/synthetics_s4_joint.yaml
```

Long Conv (joint):
```
python -m train experiment=timeseries/synthetics/synthetics_lc_joint.yaml
```

H3 (joint):
```
python -m train experiment=timeseries/synthetics/synthetics_h3_joint.yaml
```
The test set metrics are directly logged to wandb.

### Training with variable number of units
```
python -m train experiment=timeseries/synthetics/synthetics_set_seq_variable_input.yaml
```

### Model ablations
To obtain the results for Table 2 in the paper, the Set-Sequence model ablations, run the following command:
```
chmod +x model_ablations_synthetic_data.sh
./model_ablations_synthetic_data.sh
```

### Intepretability
You can visualize and compute the correlation between the set-summary and the true latent factors by calling
the following command:

```
python scripts/notebooks/true_loss_level/plot_set_var.py
```
In the script the task should be set to synthetic, and the model checkpoint will need to be specified in the get_model_config function.
## Equities Portfolio Prediction Experiments

### Dataset Setup
To run the equities experiments, you will need CRSP data, which can be downloaded from
[CRSP](https://www.crsp.org/indexes/), given the appropriate license. If you have a different data source for equities data,
or want to use a different feature set, 
you will need to modify the dataloader in src/dataloaders/dataloader_equities.py.

Before running the training scripts, do some light preprocessing of the data.
with the following command:

Set the environment variable EQUITIES_DATA_PATH to the directory containing the equities data.
```
export EQUITIES_DATA_PATH="/path/to/equities/data"
```
It is assumed that the raw CRSP data is stored as daily_price_data.npz in the EQUITIES_DATA_PATH directory.
Include the quotes " " around the path.

```
python scripts/notebooks/equities/process_raw_equities_data.py
```
File path is the path to the NPZ file containing the raw unprocessed equities data from CRSP.

It's assumed that daily_price_data.npz contains a dictionary with the following keys:

- **compustat_tensor**: np.ndarray of shape (dates, permnos, accounting_vars), (745, 36669, 31)
- **accounting_vars**: List of accounting variables (31,)
- **crsp_tensor**: np.ndarray of shape (dates, permnos, monthly_price_vars), (745, 36669, 10)
- **monthly_price_vars**: List of monthly price variables (10,)
- **dates**: List of months (745,)
- **ff_3f_daily**: np.ndarray of shape (daily_dates, ff_3f), (15628, 3)
- **rf_daily**: np.ndarray of shape (daily_dates, ret), (15628, 1)
- **daily_crsp_tensor**: np.ndarray of shape (daily_dates, permnos, daily_price_vars), (15628, 36669, 5)
- **daily_price_vars**: List of daily price variables (5,)
- **daily_dates**: np.ndarray of shape (15628,)
- **compustat_yr_tensor_filled**: np.ndarray of shape (dates, permnos, yr_accounting_vars), (745, 36669, 16)
- **compustat_yr_tensor**: np.ndarray of shape (dates, permnos, yr_accounting_vars), (745, 36669, 16)
- **yr_accounting_vars**: List of yearly accounting variables (16,)
- **ff_monthly_data**: np.ndarray of shape (dates, ff), (745, 3)
- **rf_monthly_rate**: np.ndarray of shape (dates, rf_rate), (745, 1)
- **returns**: np.ndarray of shape (dates, permnos), (745, 36669)
- **permnos**: np.ndarray of shape (permnos,), (36669,)

Here, permnos is a unique identifier for each asset.

The training script will create the majority of the features from the processed data.

3 features are not created by the training script and are created by running:

```
python src/dataloaders/features_equities.py
```

### Rolling training of Set-Sequence Model and Baselines
For the first model, you can set dataset.save_data to True to avoid recomputing dataset for each model.

Set-Sequence Model (Ours):
```
chmod +x rolling_train_equities_set_seq.sh
./rolling_train_equities_set_seq.sh
```

MHA:
```
chmod +x rolling_train_equities_mha.sh
./rolling_train_equities_mha.sh
```

Hyena:
```
chmod +x rolling_train_equities_hyena.sh
./rolling_train_equities_hyena.sh
```

S4:
```
chmod +x rolling_train_equities_s4.sh
./rolling_train_equities_s4.sh
```

Long Conv:
```
chmod +x rolling_train_equities_lc.sh
./rolling_train_equities_lc.sh
```

H3:
```
chmod +x rolling_train_equities_h3.sh
./rolling_train_equities_h3.sh
```

MHA:
```
chmod +x rolling_train_equities_mha.sh
./rolling_train_equities_mha.sh
```
### Evaluation of trained models
Update the date to the date the model was trained on.
2002 - 2016 Set-Sequence Model:
```
python eval_equities.py --model_name set_MLP --date YYYY-MM-DD --experiment_yaml equities/equities_set_seq --use_seed --seed_numbers 1,2,3,4,5 --end_year 2016
```

2002 - 2021 Set-Sequence Model:
```
python eval_equities.py --model_name set_MLP --date YYYY-MM-DD --experiment_yaml equities/equities_set_seq --use_seed --seed_numbers 1,2,3,4,5
```

2002 - 2021 Long Conv:
``` 
python eval_equities.py --model_name long-conv --experiment_yaml equities/equities_lc --date YYYY-MM-DD --use_seed --seed_numbers 1,2,3,4,5
```

2002 - 2021 S4:
```
python eval_equities.py --model_name s4_simple --experiment_yaml equities/equities_s4 --date YYYY-MM-DD --use_seed --seed_numbers 1,2,3,4,5
```

2002 - 2021 H3:
```
python eval_equities.py --model_name h3 --experiment_yaml equities/equities_h3 --date YYYY-MM-DD --use_seed --seed_numbers 1,2,3,4,5
```

2002 - 2021 MHA:
```
python eval_equities.py --model_name mha --experiment_yaml equities/equities_mha --date YYYY-MM-DD --use_seed --seed_numbers 1,2,3,4,5
```

2002 - 2021 Hyena:
```
python eval_equities.py --model_name hyena --experiment_yaml equities/equities_hyena --date YYYY-MM-DD --use_seed --seed_numbers 1,2,3,4,5
```
### Rolling training with transaction costs + Set-Sequence Model
Set-Sequence Model (Ours):
```
chmod +x rolling_train_equities_set_seq_transaction_cost.sh
./rolling_train_equities_set_seq_transaction_cost.sh
```

MHA:
```
chmod +x rolling_train_equities_mha_transaction_cost.sh
./rolling_train_equities_mha_transaction_cost.sh
```

Hyena:
```
chmod +x rolling_train_equities_hyena_transaction_cost.sh
./rolling_train_equities_hyena_transaction_cost.sh
```

S4:
```
chmod +x rolling_train_equities_s4_transaction_cost.sh
./rolling_train_equities_s4_transaction_cost.sh
```

Long Conv:
```
chmod +x rolling_train_equities_lc_transaction_cost.sh
./rolling_train_equities_lc_transaction_cost.sh
```

H3:
```
chmod +x rolling_train_equities_h3_transaction_cost.sh
./rolling_train_equities_h3_transaction_cost.sh
```

MHA:
```
chmod +x rolling_train_equities_mha_transaction_cost.sh
./rolling_train_equities_mha_transaction_cost.sh
```
## Mortgage Risk Prediction

### Dataset Setup
To run the mortgage experiments, you will need a subscription to CoreLogic (recently renamed to Cotality) to download the 
loan-level data, see [Cotality](https://www.cotality.com/products/market-intelligence).
Details on downloading the data and preprocessing it are in the [CoreLogic README](scripts/notebooks/raw_data_processing_corelogic/README.md).

To run the experiments, you will need to set the environment variable CORELOGIC_DATA_PATH to the directory containing the
data files.

```
export CORELOGIC_DATA_PATH="/path/to/corelogic/data"
```
Include the quotes " " around the path.
<!-- /share/data/llm_mortgages/original_data  -->
### Experiments
Train Set-Sequence Model on CoreLogic data (Table 6 in paper):
```
python -m train experiment=timeseries/corelogic/cl_set-seq.yaml
```
Train baselines:
```
python -m train experiment=timeseries/corelogic/cl_mha.yaml
python -m train experiment=timeseries/corelogic/cl_hyena.yaml
python -m train experiment=timeseries/corelogic/cl_s4.yaml
python -m train experiment=timeseries/corelogic/cl_lc.yaml
python -m train experiment=timeseries/corelogic/cl_h3.yaml
python -m train experiment=timeseries/corelogic/cl_linear.yaml
python -m train experiment=timeseries/corelogic/cl_5l_nn.yaml
```
### Rolling Training
To understand the robustness of the Set-Sequence model over time, we perform rolling training experiments
(See Figure 15 and Figure 16 in paper).

```
python -m train experiment=timeseries/corelogic/cl_rolling_retrain_set_seq.yaml
python -m train experiment=timeseries/corelogic/cl_rolling_retrain_nn.yaml
python -m train experiment=timeseries/corelogic/cl_rolling_retrain_linear.yaml
python -m train experiment=timeseries/corelogic/cl_rolling_retrain_gated_selection.yaml
```

## Customized Datasets
The Set-Sequence model can be applied with a wide range of datasets with 
exchangable structure. To add a new dataset, a new dataloader needs to be created, see 
for instance the equities dataloader in src/dataloaders/dataloader_equities.py.
The __getitem__ function in the dataloader should returns a vector of covariates
X with shape (num_covariates, num_units, num_timesteps), and a vector of targets
Y with shape (num_units, num_timesteps, output_dim), where output_dim is the dimension
of the target variable, for instance 1 for regression tasks, and num_classes for classification tasks.
