# "Boosting Phonocardiogram Classification Performance with Function Generated Data"

**Accepted at ML4H 2025 @ San Diego**

## Overview

This repository provides a framework for synthetic phonocardiogram (PCG) data generation and classification. The project implements multiple deep learning architectures for PCG signal classification across four cardiac conditions: Normal, Aortic Regurgitation (AR), Aortic Stenosis (AS), and Mitral Regurgitation (MR).

The framework supports:
- **Synthetic data generation** using function-based approaches
- **Multiple model architectures**: Transformers (Standard, Causal), ResNet, EfficientNet, LSTM, GRU
- **Multi-seed experimental framework** for reproducible results
- **Hyperparameter optimization** with Optuna
- **Data augmentation techniques**: masking, shifting, stretching, scaling
- **Comprehensive evaluation** with demographic analysis

## Project Structure

```
SynPCGdev/
├── config.yaml                    # Main configuration file
├── Dockerfile                     # Docker container configuration
├── invoke_container.sh           # Container management script
├── src/
│   ├── clf/                      # Classification experiments
│   │   ├── train.py              # Training entry point
│   │   ├── evaluate.py           # Evaluation entry point
│   │   ├── codes/
│   │   │   ├── run_train.py      # Training pipeline
│   │   │   ├── run_eval.py       # Evaluation pipeline
│   │   │   ├── trainer.py        # Trainer class
│   │   │   ├── evaluator.py      # Evaluator class
│   │   │   └── data/             # Data loading modules
│   │   └── resources/            # Experiment configurations
│   └── common/                   # Shared utilities
│       ├── model/                # Model architectures
│       ├── base_trainer.py       # Base training infrastructure
│       └── utils.py              # Utility functions
└── requirements.txt              # Python dependencies
```

## Installation

### Prerequisites

- Docker
- NVIDIA GPU with CUDA support
- `yq` for YAML parsing (install via `brew install yq` or `snap install yq`)

### Container Setup

1. **Configure paths** in `config.yaml`:
   ```yaml
   path:
     base_dir: .  # Repository root (default)
     dataset_dir: dataset  # Dataset location
   ```

2. **Build the Docker container**:
   ```bash
   ./invoke_container.sh build
   ```

   This creates a container based on `nvcr.io/nvidia/pytorch:24.01-py3` with Python 3.8.6 installed via pyenv.

3. **Start the container**:
   ```bash
   ./invoke_container.sh restart
   ```

   The container will mount the data path specified in `config.yaml` and provide an interactive shell.

## Usage

### Training

Navigate to the classification directory and run training:

```bash
cd src/clf
python train.py --exp <EXPERIMENT_ID> --device cuda:0 [--multirun] [--debug]
```

**Arguments**:
- `--exp`: Experiment ID (defines which config file to use from `resources/exp{XX}s/exp{XXXX}.yaml`)
- `--device`: GPU device (default: `cuda:0`)
- `--multirun`: Enable multi-seed training (runs across all seeds defined in config)
- `--debug`: Enable debug mode (uses limited data and fast iterations)

**Example**:
```bash
python train.py --exp 1 --device cuda:0
```

**Training Pipeline**:
1. Random seed initialization for reproducibility
2. Result directory preparation with timestamps
3. Model initialization and optional pretrained weight loading
4. Dataloader preparation with augmentation
5. Loss function configuration (supports class weighting)
6. Training execution with validation monitoring
7. Model checkpointing based on validation performance
8. Early stopping or Optuna pruning for hyperparameter search

**Output**: Results are saved to `results/train/exp{XXXX}/multirun/train/seed{XXXX}/` containing:
- `net.pth`: Best model checkpoint
- `params.pkl`: Training parameters
- `log.csv`: Training and validation metrics per epoch

### Evaluation

Run evaluation on trained models:

```bash
cd src/clf
python evaluate.py --exp <EXPERIMENT_ID> --device cuda:0 [--multirun] [--debug]
```

**Arguments**: Same as training command

**Example**:
```bash
# Evaluate on single seed
python evaluate.py --exp 1 --device cuda:0

# Evaluate across all seeds
python evaluate.py --exp 1 --device cuda:0 --multirun
```

**Evaluation Pipeline**:
1. Locates trained models from training output directories
2. Loads model weights and configuration
3. Runs evaluation on validation and test sets
4. Computes comprehensive metrics (F1, Precision, Recall, AUROC, AUPRC)
5. Generates classification reports and confusion matrices
6. Aggregates results across multiple seeds

**Output**: Results are saved to `results/train/exp{XXXX}/multirun/eval/` containing:
- `ResultTableMultiSeed.csv`: Aggregated metrics across all seeds
- `report.txt`: Detailed classification reports for each seed
- `demo_result.csv`: Per-sample results with demographics (if enabled)

## Configuration

Experiments are configured through YAML files. Before running experiments, configure the paths in `config.yaml`.

### Path Configuration

**Important**: Configure these paths in `config.yaml` before running experiments:

```yaml
path:
  # Base directory for all data and results (default: repository root)
  # Modify this to your preferred location
  base_dir: .  # Current directory (repository root)

  # Dataset directory (relative to base_dir)
  # Place your processed datasets here
  dataset_dir: dataset
```

**Directory Structure**:
```
SynPCGdev/              # Repository root (base_dir)
├── dataset/            # Datasets (dataset_dir)
│   ├── buet/          # BMD-HS processed data
│   └── ...
├── dataset_syn/        # Synthetic datasets
├── experiment_v01/     # Experiment results
└── ...
```

**Notes**:
- All paths can be absolute or relative to the repository root
- `base_dir`: Location for datasets, experiments, and synthetic data
- `dataset_dir`: Subdirectory for processed datasets (relative to base_dir)
- Results are saved to `{base_dir}/experiment_v01/exp{XX}s/exp{XXXX}/`

### Main Configuration (`config.yaml`)

```yaml
experiment:
  clf_exp01:
    seed:
      singleseed: 0
      multiseed: [0, 1, 2, 3, 4]  # Multiple seeds for reproducibility
    result_cols:  # Metrics columns in result CSV
      - f1score
      - Recall
      - Precision
      - AUROC
      - AUPRC

dataset:
  path: /path/to/data  # Mounted in Docker container
```

### Experiment Configuration (`resources/exp{XX}s/exp{XXXX}.yaml`)

Example configuration structure:

```yaml
# Model architecture
modelname: transformer_base  # Options: resnet18, effnet_b0, lstm, gru, transformer_base, etc.

# Data settings
dataset: pcg_dataset_name
data_lim: null  # Limit training data (null = use all)
val_data_lim: null

# Training hyperparameters
epochs: 100
batch_size: 32
learning_rate: 0.0001
patience: 20  # Early stopping patience
eval_every: 1  # Validate every N epochs

# Class weighting for imbalance
class_weight: balanced  # Options: balanced, auto, manual-{value}

# Data augmentation
augmentation:
  masking: true
  shifting: true
  stretching: true
  scaling: true

# Model-specific parameters
emb_dim: 128
depth: 4
num_heads: 8
mlp_dim: 512

# Pretrained weights (optional)
finetune_target: /path/to/pretrained/model
freeze: false  # Freeze pretrained weights
```

## Supported Models

### Transformer-based
- `transformer_base`: Standard Transformer encoder
- `causal_transformer`: Causal Transformer for sequential modeling

### Convolutional Neural Networks
- `resnet18`, `resnet34`, `resnet50`: ResNet architectures
- `effnet_b0` through `effnet_b7`: EfficientNet models

### Recurrent Neural Networks
- `lstm`: Long Short-Term Memory networks
- `gru`: Gated Recurrent Units

All models support:
- Pretrained weight loading
- Weight freezing for transfer learning
- Custom embedding dimensions and depths
- Multi-GPU training with DataParallel

## Data Augmentation

The framework supports multiple augmentation techniques for PCG signals:

- **Masking**: Random time-domain masking for robust feature learning
- **Shifting**: Temporal shifting to handle phase variations
- **Stretching**: Time-scale modification for tempo variations
- **Scaling**: Amplitude scaling for intensity variations

Enable/disable augmentations in experiment configuration files.

## Citation

If you use this code in your research, please cite:

```bibtex
@inproceedings{synpcgdev2025,
  title={Boosting Phonocardiogram Classification Performance with Function Generated Data},
  author={[Authors]},
  booktitle={Machine Learning for Health (ML4H) 2025},
  year={2025},
  address={San Diego, CA}
}
```

*(Full bibtex information will be updated after publication)*

## License

[License information to be added]

## Contact

For questions or issues, please open an issue on GitHub or contact the authors.
