# Offline Data Usage Guide

## 📦 Sample Data

Pre-extracted sample data is provided for training without database access.

### Data Files

- **Location**: `data_dir/sample_data/sample_data.pkl`
- **Size**: ~990 MB
- **Number of samples**: 1390 (extracted from the full dataset)
- **Format**: Pickle format containing complete data samples

### Data Content

Each sample contains:
- Image data (empty placeholders for multimodal alignment)
- Text data (empty placeholders)
- Time series data (complete)
- Static feature data
- Label data (mortality_24h_48h, los_prediction_48h)

## 🚀 Usage

### Method 1: Using Offline Configuration (Recommended)

```bash
# Navigate to release directory
cd release

# Run training with offline configuration
python3 train_mortality_los_complete.py \
    --gpu 0 \
    --epochs 20 \
    --batch_size 256 \
    --max_samples 0 \
    --config exp/mimic_data/exp_mortality_24h48h_los_offline.yaml
```

### Method 2: Modify Configuration File

Edit `exp/mimic_data/exp_mortality_24h48h_los.yaml` and add:

```yaml
data:
  train_val:
    offline_data_path: data_dir/sample_data/sample_data.pkl
    db_path: null  # Set to null to use offline data
    use_data_cache: false
```

Then run normally:

```bash
python3 train_mortality_los_complete.py --gpu 0 --epochs 20
```

## 📋 Data Extraction

If you need to re-extract data or extract a different number of samples:

```bash
# Extract 5000 samples
python3 scripts/extract_sample_data.py \
    --num_samples 5000 \
    --output_dir data_dir/sample_data

# Extract custom number of samples
python3 scripts/extract_sample_data.py \
    --num_samples 10000 \
    --output_dir data_dir/sample_data_large
```

## ✅ Verification

Verify that offline data can be loaded correctly:

```python
from datapress.Aligned.offline_sample_dataset import OfflineSampleDataset

dataset = OfflineSampleDataset("data_dir/sample_data/sample_data.pkl")
print(f"Dataset size: {len(dataset)}")

# Test loading a sample
sample = dataset[0]
print(f"Sample keys: {type(sample)}")
```

## 🔍 Data Statistics

- **Total samples**: 1390
- **Data format**: Fully compatible with original dataset
- **Labels**: Includes mortality_24h_48h and los_prediction_48h
- **Time series**: Complete time series features (168 time points)
- **Static features**: Complete static features

## ⚠️ Notes

1. **Data size**: The sample data file is ~990 MB, ensure sufficient disk space
2. **Data limitation**: Sample data contains 1390 samples (the actual dataset may be larger)
3. **Performance**: Training performance with offline data should be similar to using database cache
4. **Compatibility**: Offline data is fully compatible with original data format, no code modification needed

## 📝 File List

- `data_dir/sample_data/sample_data.pkl` - Sample data file (990 MB)
- `data_dir/sample_data/sample_metadata.json` - Data metadata
- `exp/mimic_data/exp_mortality_24h48h_los_offline.yaml` - Offline configuration file
- `datapress/Aligned/offline_sample_dataset.py` - Offline data loader
- `scripts/extract_sample_data.py` - Data extraction script

## 🎯 Quick Start

1. **Ensure data file exists**:
   ```bash
   ls -lh data_dir/sample_data/sample_data.pkl
   ```

2. **Run training with offline configuration**:
   ```bash
   python3 train_mortality_los_complete.py \
       --gpu 0 \
       --epochs 1 \
       --batch_size 32 \
       --max_samples 0
   ```
   
   The training script will automatically detect and use offline data.

3. **Verify training is normal**:
   - Check if "Using offline sample data" is displayed
   - Training should start normally without database connection

## 📊 Expected Results

When training with offline data, you should see:

```
Creating dataset...
Using offline sample data: data_dir/sample_data/sample_data.pkl
Loaded 1390 samples from offline data
Offline dataset size: 1390
```

Then training proceeds normally, same as with database.

## 🔧 Troubleshooting

### Issue 1: Data file not found

**Solution**: Ensure `data_dir/sample_data/sample_data.pkl` exists

```bash
ls -lh data_dir/sample_data/
```

### Issue 2: Out of memory

**Solution**: Reduce batch size or number of samples

```bash
python3 train_mortality_los_complete.py --batch_size 64 --max_samples 500
```

### Issue 3: Data loading error

**Solution**: Check if data file is complete

```bash
python3 -c "import pickle; data = pickle.load(open('data_dir/sample_data/sample_data.pkl', 'rb')); print(f'Samples: {data[\"num_samples\"]}')"
```
