# BMD-HS Dataset Preparation

This directory contains scripts for processing the BMD-HS (Bangladesh Medical Data - Heart Sound) dataset.

## Dataset Information

- **Repository**: [sani002/BMD-HS-Dataset](https://github.com/sani002/BMD-HS-Dataset)
- **Description**: Heart sound recordings from patients with various valvular heart diseases
- **Format**: WAV audio files
- **Labels**: AS (Aortic Stenosis), AR (Aortic Regurgitation), MR (Mitral Regurgitation), MS (Mitral Stenosis), N (Normal)

## Quick Start

### Step 1: Download the Dataset Manually

1. Visit the BMD-HS repository: https://github.com/sani002/BMD-HS-Dataset
2. Download all `.wav` files from the repository
3. Create a directory:
   ```bash
   mkdir -p /path/to/data/bmdhs
   ```
4. Place all downloaded `.wav` files directly in `/path/to/data/bmdhs/`

**Expected files**: The directory should contain approximately 865 WAV files with names like:
- `MD_001_sup_Mit.wav`
- `MD_001_sup_Tri.wav`
- `MR_002_sup_Mit.wav`
- etc.

### Step 2: Process and Organize Files

Once you have downloaded the files, run the processing script:

```bash
python src/prep/dataset/buet.py \
    --download_dir /path/to/data/bmdhs \
    --processed_dir /path/to/data/bmdhs_processed
```

**What this does**:
- Reads `file_link_table.csv` to get file mapping information
- Copies files from `bmdhs/` to `bmdhs_processed/` with standardized naming
- Generates `metadata.csv` with all dataset information
- Verifies file integrity

## Alternative: Use Individual Scripts

If you prefer more control, you can use `process_bmdhs.py` directly:

```bash
python src/prep/dataset/process_bmdhs.py \
    --download_dir /path/to/data/bmdhs \
    --processed_dir /path/to/data/bmdhs_processed \
    --csv_file src/prep/dataset/file_link_table.csv \
    --skip_existing \
    --verify
```

**Arguments:**
- `--download_dir`: Directory containing downloaded .wav files (files should be directly in this directory)
- `--processed_dir`: Directory where processed files will be saved
- `--csv_file`: Path to file_link_table.csv (default: file_link_table.csv in the same directory as the script)
- `--skip_existing`: Skip if destination file already exists
- `--verify`: Verify file integrity after copying

## Output Structure

After processing, your data directory will have the following structure:

```
bmdhs/
├── MD_001_sup_Mit.wav
├── MD_001_sup_Tri.wav
├── ...
└── (original filenames from GitHub)

bmdhs_processed/
├── 00001.wav
├── 00002.wav
├── 00003.wav
├── ...
└── metadata.csv
```

## File Mapping

The `file_link_table.csv` file (already included in this repository) contains the mapping between original and processed filenames, along with metadata:

| Column | Description |
|--------|-------------|
| `raw` | Original filename from the dataset |
| `rename` | New standardized filename |
| `patient_id` | Unique patient identifier |
| `sitting_position` | Patient position (sup/sit) |
| `auscultation_position` | Recording location (Mit/Tri/Pul/Aor) |
| `AS`, `AR`, `MR`, `MS`, `N` | Disease labels (binary) |
| `Age`, `Gender`, `Smoker`, `Lives` | Patient demographics |
| `is_multi_vhd` | Multiple valvular heart diseases flag |
| `comb_vhd_name` | Combined disease name |
| `split_0` to `split_4` | Train/val assignments for 5-fold CV |

## Metadata File

After processing, a `metadata.csv` file is generated in the processed directory containing all the information from `file_link_table.csv` with updated filenames. This file can be used for:
- Loading data with proper labels
- Splitting data for cross-validation
- Filtering by patient demographics
- Analyzing dataset distribution

## Scripts Overview

### `buet.py`

Main processing script that orchestrates the entire workflow.

**Features:**
- Checks if raw files are downloaded
- Processes files according to mapping table
- Generates metadata CSV
- Provides clear error messages

### `process_bmdhs.py`

Core processing script that handles file operations.

**Features:**
- File renaming and organization
- Metadata CSV generation
- File integrity verification
- Progress reporting
- Error handling

## Troubleshooting

### Missing Files

If you see "No .wav files found in download directory":

1. Verify you've downloaded the files from GitHub
2. Check that files are in the correct directory: `<download_dir>/`
3. Ensure filenames match the expected format (e.g., `MD_001_sup_Mit.wav`)

### File Size Issues

If you encounter file size mismatches during verification:

1. Re-download the affected files from GitHub
2. Check available disk space
3. Verify files were completely downloaded

### Processing Errors

If some files fail to process:

1. Check the error messages for specific file names
2. Verify those files exist in the download directory
3. Check file permissions
4. Look at the console output for detailed error information

## Dataset Statistics

- **Total files**: ~865 WAV files
- **Patients**: ~108 patients
- **Recording positions**: 2 (supine, sitting)
- **Auscultation locations**: 4 (Mitral, Tricuspid, Pulmonary, Aortic)
- **Cross-validation folds**: 5

## Example Workflow

```bash
# 1. Create directory
mkdir -p /data/bmdhs

# 2. Download files from GitHub (manual step)
# Visit: https://github.com/sani002/BMD-HS-Dataset
# Download all .wav files directly to /data/bmdhs/

# 3. Verify files are downloaded
ls /data/bmdhs/*.wav | wc -l
# Should show ~865 files

# 4. Process the dataset
python src/prep/dataset/buet.py \
    --download_dir /data/bmdhs \
    --processed_dir /data/bmdhs_processed

# 5. Check processed files
ls /data/bmdhs_processed/*.wav | wc -l
# Should show ~865 files

# 6. View metadata
head -20 /data/bmdhs_processed/metadata.csv
```

## Citation

If you use this dataset, please cite the original paper:

```bibtex
@article{ali2024buet,
  title={BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems},
  author={Ali, Shams Nafisa and Zahin, Afia and Shuvo, Samiul Based and Nizam, Nusrat Binta and Nuhash, Shoyad Ibn Sabur Khan and Razin, Sayeed Sajjad and Sani, SM and Rahman, Farihin and Nizam, Nawshad Binta and Azam, Farhat Binte and others},
  journal={arXiv preprint arXiv:2409.00724},
  year={2024}
}
```

## License

The BMD-HS dataset follows the license specified in the original repository: https://github.com/sani002/BMD-HS-Dataset

## Contact

For issues related to the dataset preparation scripts, please open an issue in this repository.

For questions about the original dataset, please contact the dataset authors through their GitHub repository.
