# Data Directory

This directory should contain the real-world datasets used in the paper.

## Dataset Download Instructions

| Dataset | Source | Download URL | Expected Filename |
|---------|--------|--------------|-------------------|
| HTRU2 | UCI ML Repository | https://archive.ics.uci.edu/ml/datasets/HTRU2 | `htru2.csv` |
| Credit Card Fraud | Kaggle | https://www.kaggle.com/mlg-ulb/creditcardfraud | `creditcard.csv` |
| Ionosphere | UCI ML Repository | https://archive.ics.uci.edu/ml/datasets/ionosphere | `ionosphere.csv` |
| Weekly | ISLR Package | https://www.statlearning.com/ | `weekly.csv` |

## Download Script

```bash
# HTRU2 (from UCI)
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip
unzip HTRU2.zip
# Convert to CSV with header: add column names manually
# Columns: mean_ip, std_ip, ek_ip, sk_ip, mean_dm, std_dm, ek_dm, sk_dm, class

# Credit Card (requires Kaggle account)
kaggle datasets download -d mlg-ulb/creditcardfraud
unzip creditcardfraud.zip

# Ionosphere (from UCI)
wget https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data
# Convert to CSV with header: 34 features + class column

# Weekly (from ISLR R package or website)
# Export from R: write.csv(Weekly, "weekly.csv", row.names=FALSE)
```

## Expected Directory Structure

```
data/
├── README.md          (this file)
├── htru2.csv
├── creditcard.csv
├── ionosphere.csv
└── weekly.csv
```

## MD5 Checksums (for verification)

After downloading, verify file integrity:

```
# Generate checksums
md5sum *.csv

# Expected (approximate - may vary by download source):
# htru2.csv:      ~2MB
# creditcard.csv: ~150MB
# ionosphere.csv: ~25KB
# weekly.csv:     ~10KB
```

## Dataset Details

### HTRU2 (High Time Resolution Universe Survey)
- **Task**: Pulsar detection from radio telescope data
- **Size**: 17,898 samples, 8 features
- **Classes**: Non-pulsar (0) vs Pulsar (1)
- **Imbalance**: 90.8% / 9.2%
- **Heavy tails**: Pulsars exhibit heavy-tailed signal characteristics

### Credit Card Fraud
- **Task**: Fraud detection in credit card transactions
- **Size**: 284,807 samples, 30 features (28 PCA + Time + Amount)
- **Classes**: Normal (0) vs Fraud (1)
- **Imbalance**: 99.83% / 0.17%
- **Heavy tails**: Fraud transactions show extreme feature values

### Ionosphere
- **Task**: Radar return classification
- **Size**: 351 samples, 34 features
- **Classes**: Bad (b) vs Good (g)
- **Imbalance**: 35.9% / 64.1%
- **Heavy tails**: Radar signals exhibit heavy-tailed amplitude distributions

### Weekly Stock Returns
- **Task**: Market direction prediction
- **Size**: 1,089 samples, 8 features (Lag1-Lag5, Volume, Today, Direction)
- **Classes**: Down vs Up
- **Imbalance**: 44.4% / 55.6%
- **Heavy tails**: Financial returns are well-known to be heavy-tailed

## Citation

If using these datasets, please cite the original sources:

```bibtex
@misc{htru2,
  author = {Lyon, R.J.},
  title = {HTRU2},
  year = {2016},
  howpublished = {UCI Machine Learning Repository}
}

@misc{creditcard,
  author = {Dal Pozzolo, A. et al.},
  title = {Credit Card Fraud Detection},
  year = {2015},
  howpublished = {Kaggle}
}

@misc{ionosphere,
  author = {Sigillito, V.G. et al.},
  title = {Ionosphere Data Set},
  year = {1989},
  howpublished = {UCI Machine Learning Repository}
}

@book{islr,
  author = {James, G. and Witten, D. and Hastie, T. and Tibshirani, R.},
  title = {An Introduction to Statistical Learning},
  year = {2013},
  publisher = {Springer}
}
```
