# Overview of code usage for the accompanying paper titled "EEG-Language Pretraining for Highly Label-Efficient Pathology Detection"

Here we share all the critical code used for preprocessing and analysis following downloading of anonymized data.
We provide references to the relevant code using python 3.10.9. Pretraining was performed with PyTorch 1.12.1 and used <24GB GPU RAM. 

**Installation**
`conda create -n elm python=3.10.9`
`conda activate elm`
`pip install -r requirements.txt`
to match the versions used in the project:
`conda install pytorch==1.12.1 torchvision torchaudio cudatoolkit=11.3 numpy=1.23.5 -c pytorch`


**EEG Preprocessing**

Step 1. Read in the .edf EEG files and apply the EEG preprocessing:
`preprocess_TUEG` function from utils/preprocess_TUEG.py

Step 2. The preprocessed files are then aggregated into an .h5 file for more efficient data loading:
`TUEG_to_h5_epochs` function from utils/preprocess_TUEG.py



**Text Preprocessing**
Clinical reports are preprocessed online via the `preprocess_report` function from utils/ELM_utils.py,
which is called via the pytorch dataset (datasets/datasets.py) during training. 
One analysis instead uses LLM summaries of the reports. This is prepared using the `LLM_summarization` function from utils/ELM_utils.py.



**Pretraining** (GPU)
EEG-Language pretraining is ran as follows:
torchrun --nproc_per_node=1 --master_port 12000 run_DL.py -f elm_mil_pretrain.yaml
where -f takes the filename of a .yaml configuration file. Configuration files are provided under /configs.

Pretraining requires:
The preprocessed EEG .h5 file
Text reports in a .json file
*_indices.npy file: 1d Numpy array specifying which EEG crops should be sampled, corresponding to the desired train or test splits.



**Linear probing** (CPU)
Given embeddings created by run_DL.py, to run linear probes:
python run_ML.py -f linear_probe.yaml
