# Missing Data

**TL; DR** See the [Example](#example) section to see the basic usage, and the [Add new model](#add-new-model) section to add a new model.

## Installation

```bash
pip install -r requirements.txt
```

TODO: Add dataset initialization

## Usage

### Train model

```bash
python scripts/train.py \
    -f, --config-file CONFIG_FILE \
    [-o, --output-dir OUTPUT_DIR] \
    [--dev] \
    [additional options]
```

You can change the values of the parameters in the config file using `--<config-key> <value>` options.
For example, if you want to change the `early_stopping` to 10, you can use `--train.early_stopping 10`.
Similarly, if you want to change the `learning_rate` to 0.1, you can use `--train.learning_rate 0.1` or `-lr 0.1`, because `-lr` is registered as an alias of `--train.learning_rate` in `scripts/train.py`.

**NOTE**: The outputs are actually saved in `outs/_/<date>-<time>-<id>/` directory.
          The `--output-dir` option just makes the link to the directory.

#### See help

```bash
python scripts/train.py -f CONFIG_FILE --help
```

#### Example

- Basic case:

    ```bash
    python scripts/train.py -f configs/physionet2012/GRU/base.yaml
    ```

  - Use config file `configs/physionet2012/GRU/base.yaml` to train a GRU model.
  - Save outputs under `outs/physionet2012/GRU/` (default output defined in `scripts/train.py`)

- Advanced case:

    ```bash
    python scripts/train.py \
        -f configs/physionet2012/GRU/base.yaml \
        -o outs/physionet2012/GRU/test1 \
        -lr 0.005 \
        --model.n_units 64 \
        --dev
    ```

  - Use config file `configs/physionet2012/GRU/base.yaml` to train a GRU model.
  - Save outputs under `outs/physionet2012/GRU/test1/`.
  - Set learning rate `0.005`.
  - Change the model arguments `n_units` to `64`.
  - Mark this as a development run.

## Add new model

1. Make a new file `missing/models/<new-model>.py` and implement the model.
   The model's `__init__` function should take `output_activation` and `output_dims` as first two arguments.
   And the model should have `get_config(self)` method to return a dictionary of the model's configuration.
   For example,

    ```python
    # missing/models/new_model.py

    class NewModel(keras.Model):
        def __init__(self, output_activation, output_dims, n_layers, n_hidden):
            self._config = {k: v for k, v in locals().items() if k not in ["self", "__class__"]}
            super().__init__()

            # do initialization here

        def get_config(self):
            return self._config

        def call(self, inputs):
            ...
    ```

2. Add the model to `missing/models/__init__.py`. For example,

    ```python
    from .new_model import NewModel  # Add NewModel from missing/models/new_model.py
    ```

3. Make a config file under `configs/` directory.

    1. The file name and the location is not restricted, but it should use the following format:

        ```yaml
        # configs/physionet2012/NewModel/base.yaml
        # (but it also can be something like 'configs/new_model.yaml'.)

        model:
          name: NewModel
          # ==============================
          # Add the model arguments here
          n_layers: 5
          n_hidden: 64
          # ==============================
        dataset:
          name: physionet2012
          balance: true
          loss: binary_crossentropy
          output_dims: 1
          output_activation: sigmoid
          metrics:
            auprc: metrics.auprc
            auroc: metrics.auroc
            brier: metrics.brier
            ece: metrics.ece
            logloss: metrics.logloss
            accuracy: metrics.accuracy
        train:
          seed: null
          max_epochs: 1000
          batch_size: 256
          learning_rate: 0.001
          warmup_steps: 0
          early_stopping: 30
          monitor_quantity: auprc
          direction_of_improvement: max
        test:
          seed: null
          ensemble_size: 30
        ```

        However, this is too verbose. So, we recommend to split the config files. (See below.)

    2. You can use predefined partial config files to reduce the verbosity.

        First, make the basic partial config file for the model.

        ```yaml
        # configs/models/new_model.yaml

        name: NewModel
        # ==============================
        # Add the model arguments here
        n_layers: 5
        n_hidden: 64
        # ==============================
        ```

        Second, include the partial config files in the main config file.

        ```yaml
        # configs/physionet2012/NewModel/base.yaml

        model:   !include configs/models/new_model.yaml
        dataset: !include configs/physionet2012/dataset.yaml  # predefined dataset configs
        train:   !include configs/physionet2012/train.yaml    # predefined train configs
        test:    !include configs/physionet2012/test.yaml     # predefined test configs
        ```

        Then, you can train the model by using the following command:

        ```bash
        python scripts/train.py -f configs/physionet2012/NewModel/base.yaml
        # you can see the outputs under `outs/physionet2012/NewModel/<date>-<time>-<id>/`
        ```

    3. For the model being developed, it may be convenient to use the following config format.

        ```yaml
        # configs/physionet2012/NewModel/test1.yaml

        model:
          name: NewModel
          # ==============================
          # Add the model arguments here
          n_layers: 7
          n_hidden: 256
          # ==============================
        dataset: !include configs/physionet2012/dataset.yaml  # predefined dataset configs
        train:   !include configs/physionet2012/train.yaml    # predefined train configs
        test:    !include configs/physionet2012/test.yaml     # predefined test configs
        ```

        Then, train the model by using the following command:

        ```bash
        python scripts/train.py \
            -f configs/physionet2012/NewModel/test1.yaml \
            -o outs/physionet2012/NewModel/test1  # explicitly set the output directory
        # outputs under `outs/physionet2012/NewModel/test1/<date>-<time>-<id>/`
        ```

        And you can test other settings by using additional options. For example,

        ```bash
        python scripts/train.py \
            -f configs/physionet2012/NewModel/test1.yaml \
            -o outs/physionet2012/NewModel/test2 \  # explicitly set the output directory
            --model.n_layers 3 \
            --model.n_hidden 32
        # outputs under `outs/physionet2012/NewModel/test2/<date>-<time>-<id>/`
        ```

        Furthermore, mark the train as a development run can be useful for debugging.
        `--dev` option adds the `dev-` prefix to the output directory.

        ```bash
        python scripts/train.py \
            -f configs/physionet2012/NewModel/test1.yaml \
            -o outs/physionet2012/NewModel/test3 \  # explicitly set the output directory
            --model.n_layers 9 \
            --dev  # mark the train as a development run
        # outputs under `outs/physionet2012/NewModel/test3/dev-<date>-<time>-<id>/`
        ```

4. Commit the changes.

<br>
<br>
<br>

## Appendix

### Dataset specs

#### physionet2012

```python
# data
TensorSpec(shape=(8,),       dtype=tf.float32)  # demographics(8)
TensorSpec(shape=(None,),    dtype=tf.float32)  # time
TensorSpec(shape=(None, 37), dtype=tf.float32)  # vitals(37)
TensorSpec(shape=(None, 37), dtype=tf.bool)     # measurement mask
TensorSpec(shape=(),         dtype=tf.int32)    # length

# label
TensorSpec(shape=(),         dtype=tf.int64)    # in_hospital_death (binary-label)
```

#### physionet2019

```python
# data
TensorSpec(shape=(4,),       dtype=tf.float32)  # demographics(4)
TensorSpec(shape=(None,),    dtype=tf.float32)  # time
TensorSpec(shape=(None, 34), dtype=tf.float32)  # lab_measurements(26) + vitals(8)
TensorSpec(shape=(None, 34), dtype=tf.bool)     # measurement mask
TensorSpec(shape=(),         dtype=tf.int32)    # length

# label
TensorSpec(shape=(None, 1),  dtype=tf.int32)    # sepsis (per-time binary-label)
```

#### mimic3_mortality

```python
# data
TensorSpec(shape=(1,),       dtype=tf.float32)  # demographics(1)
TensorSpec(shape=(None,),    dtype=tf.float32)  # time
TensorSpec(shape=(None, 16), dtype=tf.float32)  # interventions(5) + lab_measurements(4) + vitals(7)
TensorSpec(shape=(None, 16), dtype=tf.bool)     # measurement mask
TensorSpec(shape=(),         dtype=tf.int32)    # length

# label
TensorSpec(shape=(),         dtype=tf.int64)    # in_hospital_mortality (binary-label)
```

#### mimic3_phenotyping

```python
# data
TensorSpec(shape=(1,),       dtype=tf.float32)  # demographics(1)
TensorSpec(shape=(None,),    dtype=tf.float32)  # time
TensorSpec(shape=(None, 16), dtype=tf.float32)  # interventions(5) + lab_measurements(4) + vitals(7)
TensorSpec(shape=(None, 16), dtype=tf.bool)     # measurement mask
TensorSpec(shape=(),         dtype=tf.int32)    # length

# label
TensorSpec(shape=(25,),      dtype=tf.int32)    # phenotype (multi-label)
```
