# Hierarchical Bayesian Logistic Regression (BLR) Experiments

This directory contains a Hierarchical Bayesian Logistic Regression (BLR) model for binary classification using the UCI Covertype dataset, along with convergence experiments using the Stein Variational Gradient Descent (SVGD) algorithm.

The goal is to approximate the posterior distribution of the hierarchical BLR model's parameters using SVGD and evaluate its convergence against a "ground truth" posterior generated by MCMC.

## Recent Updates

- **Parameterization**: Changed from `[β, τ]` to `[β, log(τ)]` for better numerical stability
- **Prior Parameters**: Updated default prior parameters to `α_prior = 1.0, β_prior = 0.01`
- **Likelihood Tracking**: Added per-sample negative log likelihood (NLL) tracking in SVGD
- **Evaluation Metrics**: Replaced Gaussian KL divergence with KDE-based KL divergence and MMD
- **Analysis Tools**: Added comprehensive MCMC analysis tools for likelihood and accuracy evaluation
- **Cache Management**: Python and R cache files are now automatically excluded via `.gitignore`

## Model Structure

The hierarchical Bayesian model follows the structure described in the paper (Section 5.3):

- **Global precision parameter**: τ ~ Gamma(α_prior, β_prior)
- **Regression coefficients**: β_d ~ Normal(0, 1/τ) for d = 1, ..., D
- **Likelihood**: y_n ~ Bernoulli(logit(X_n^T β)) for n = 1, ..., N

This hierarchical structure allows the model to learn the appropriate regularization strength from the data.

### Parameterization

For numerical stability, the model uses log-transformed precision parameter:
- **Parameters**: θ = [β_1, ..., β_D, log(τ)]
- **Transformation**: τ = exp(log(τ))
- **Prior for log(τ)**: Transformed from Gamma(α_prior, β_prior) using change of variables

### Default Prior Parameters

- **α_prior = 1.0**: Shape parameter for Gamma prior
- **β_prior = 0.01**: Rate parameter for Gamma prior (updated from 0.1 for more flexible regularization)

## Dataset

- **Source**: UCI Covertype Dataset (binary version)
  - **Primary**: `covtype.libsvm.binary.scale.bz2` - Pre-scaled libsvm format binary dataset
  - **Fallback**: Original UCI Covertype dataset converted to binary classification
- **Features**: 54 dimensions
- **Classes**: 2 (binary classification)
- **Samples**: Full dataset used by default (libsvm format), or subset of 10,000 samples (fallback method)

The libsvm format dataset (`covtype.libsvm.binary.scale.bz2`) is the primary dataset used in the paper. This file contains pre-scaled binary classification data that matches the experimental setup described in the paper. The data is automatically loaded when available, with fallback to the original UCI Covertype dataset if the libsvm file is not found.

## Structure

### Core Files
- `blr_model.stan`: The Stan model definition for Hierarchical Bayesian Logistic Regression. It is used by `run_mcmc.py` to generate the reference posterior distribution via Hamiltonian Monte Carlo (HMC).
- `model.py`: A Python implementation of the hierarchical BLR model with log(τ) parameterization. It provides the necessary log-posterior gradient computations required by the SVGD algorithm.
- `SVGD.py`: A general-purpose implementation of the Stein Variational Gradient Descent (SVGD) algorithm, featuring an RBF kernel, AdaGrad optimizer, and likelihood tracking capabilities.

### Experiment Scripts
- `run_mcmc.py`: An execution script that uses `cmdstanpy` to run MCMC sampling on the `blr_model.stan`, producing the ground truth posterior parameters.
- `running_experiments_decay.py`: The main script for executing SVGD experiments with learning rate decay. It utilizes the model from `model.py` and the algorithm from `SVGD.py` to approximate the posterior.

### Analysis Tools
- `load_mcmc_results.py`: Utilities for loading and analyzing MCMC results from pickle files.
- `analyze_mcmc_likelihood_accuracy.py`: Comprehensive analysis of MCMC samples including likelihood calculation and prediction accuracy evaluation.
- `plot.ipynb`: A Jupyter notebook for visualizing and analyzing the experiment results, such as NLL, KSD, and KDE-KL convergence over iterations.

### Documentation
- `README.md`: This file.

## Setup

### Required Dependencies

```bash
pip install numpy pandas scikit-learn cmdstanpy tqdm matplotlib seaborn jupyter
```

**Note**: Cache files (`__pycache__/`, `.pyc`, `.Rhistory`, etc.) are automatically excluded via `.gitignore`.

### Installing CmdStan

CmdStan is a dependency for `cmdstanpy` and is required to run the MCMC sampling.

```bash
# CmdStanPy can automatically install the latest version of CmdStan
import cmdstanpy
cmdstanpy.install_cmdstan()
```

## Usage

The workflow consists of two main steps:

### 1. Generate True Posterior via MCMC

First, run MCMC using `cmdstanpy` to estimate the true posterior distribution parameters. This serves as the ground truth for evaluating SVGD.

```bash
python run_mcmc.py
```

This script performs the following:
- Loads and preprocesses the binary UCI Covertype dataset (libsvm format if available, fallback to original dataset).
- Compiles the Stan model defined in `blr_model.stan`.
- Executes MCMC sampling (default: 2000 samples, 1000 warmup, 4 chains).
- Saves the MCMC samples and calculated true posterior parameters (mean and covariance) to `mcmc_results_*.pkl`.

### 2. Execute SVGD Experiments

Once the MCMC results are available, run the SVGD experiments to approximate the posterior.

```bash
python running_experiments_decay.py
```

This script performs the following:
- Loads the true posterior parameters from the MCMC results file.
- Runs SVGD experiments for different numbers of particles (e.g., 5, 10, 20, 50).
- For each experiment, it computes NLL, KDE-KL divergence, MMD, and KSD against the true posterior.
- Saves the comprehensive results to `svgd_results_*.pkl`.

### 3. Analyze MCMC Results (Optional)

For detailed analysis of MCMC samples including likelihood and accuracy evaluation:

```bash
python analyze_mcmc_likelihood_accuracy.py
```

This script provides:
- Per-sample likelihood calculation and analysis
- Prediction accuracy evaluation on training and test data
- Comprehensive visualization of MCMC results
- Comparison with reference values

### Command Line Options

#### run_mcmc.py
```bash
# Example: Run with 2000 samples, 1000 warmup, 4 chains, and specific priors
python run_mcmc.py --n_samples 2000 --n_warmup 1000 --chains 4 --random_seed 42 --alpha_prior 1.0 --beta_prior 1.0
```

#### running_experiments_decay.py
```bash
# Example: Run with a specific number of iterations and updated priors
python running_experiments_decay.py --n_iterations 10000 --alpha_prior 1.0 --beta_prior 0.01
```

## Model and Algorithm

### Hierarchical Bayesian Logistic Regression

The posterior distribution is defined in two complementary ways:
1. **Stan Model (`blr_model.stan`)**: A formal probabilistic model definition used with the NUTS sampler in Stan to generate a high-fidelity reference posterior.
2. **Python Model (`model.py`)**: A Python implementation that provides the gradient of the log-posterior probability, which is required by the SVGD algorithm.

The hierarchical structure includes:
- A global precision parameter τ with a Gamma prior
- Regression coefficients β with a Normal prior that depends on τ
- Binary classification likelihood using the logistic link function

### Stein Variational Gradient Descent (SVGD)

SVGD is a particle-based variational inference method that iteratively updates a set of particles to approximate the target posterior distribution. The algorithm uses:
- **Kernel**: RBF kernel with adaptive bandwidth selection
- **Optimizer**: AdaGrad with momentum (can be disabled for vanilla SGD)
- **Learning rate decay**: Configurable decay schedules for better convergence
- **Likelihood tracking**: Per-sample negative log likelihood (NLL) monitoring during training

## Evaluation Metrics

The experiments evaluate SVGD convergence using several metrics:

1. **Negative Log Likelihood (NLL)**: Per-sample negative log likelihood for monitoring convergence
2. **KDE-based KL Divergence**: Non-parametric KL divergence estimation using kernel density estimation
3. **Maximum Mean Discrepancy (MMD)**: Kernel-based measure of distributional discrepancy
4. **Kernel Stein Discrepancy (KSD)**: A kernel-based measure of distributional discrepancy
5. **Eigenvalue Analysis**: Examines the spectrum of the kernel matrix to understand the algorithm's behavior
6. **Prediction Accuracy**: Classification accuracy on test data using the learned parameters

## Results

Results are saved in the `results/` directory with the following naming convention:
- `mcmc_results_*.pkl`: MCMC posterior samples and parameters
- `svgd_results_n{particles}_beta{decay}_iter{iterations}.pkl`: SVGD experiment results

Each SVGD result file contains:
- Final particle positions
- NLL convergence history
- KDE-KL divergence and MMD convergence history
- KSD convergence history
- Eigenvalue analysis
- Model parameters and settings
- True posterior parameters for comparison

Each MCMC result file contains:
- MCMC samples array
- True posterior mean and precision matrix
- Sampling parameters and settings

## Notes

- The libsvm format dataset (`covtype.libsvm.binary.scale.bz2`) is the primary dataset used in the paper and provides the most accurate reproduction of the experimental setup
- The binary classification setup provides a simpler but still challenging problem for evaluating SVGD convergence
- The hierarchical structure allows the model to adapt its regularization strength to the data
- The log(τ) parameterization provides better numerical stability compared to direct τ parameterization
- Updated prior parameters (β_prior = 0.01) provide more flexible regularization
- MCMC convergence should be verified before using the results for SVGD evaluation
- The default hyperparameters are chosen to balance computational efficiency with accuracy
- The libsvm data is pre-scaled, so no additional standardization is applied
- NLL tracking provides real-time convergence monitoring during SVGD training
- KDE-based KL divergence and MMD provide more robust evaluation metrics than Gaussian approximations

## Troubleshooting

### CmdStanPy Errors
If you encounter issues with CmdStan, try reinstalling it:
```bash
python -c "import cmdstanpy; cmdstanpy.install_cmdstan(overwrite=True, compiler=True)"
```

### Memory Issues
- Reduce the number of MCMC samples (e.g., `--n_samples 500`).
- Use a smaller subset of the data (e.g., by modifying `n_samples` in the data loading section of the scripts).

### Dependency Issues
Ensure all packages are up-to-date:
```bash
pip install --upgrade numpy pandas scikit-learn cmdstanpy tqdm matplotlib seaborn jupyter
```

### Analysis Tools
For comprehensive analysis of results, use the provided analysis tools:
```bash
# Load and analyze MCMC results
python analyze_mcmc_likelihood_accuracy.py

# Use Jupyter notebook for detailed visualization
jupyter notebook plot.ipynb
``` 