# Mixed Diffusion: A Unified Framework for Denoising Single-Cell Data

A PyTorch implementation of mixed diffusion models for single-cell genomics data denoising and analysis. This framework combines diffusion models with Gibbs sampling to denoise high-dimensional biological data, particularly focused on single-cell RNA sequencing (scRNA-seq) and CITE-seq data.

## Overview

Mixed Diffusion addresses the challenge of noise in single-cell genomics data by learning the underlying data distribution through diffusion processes. The model can handle both synthetic and real biological datasets, providing robust denoising capabilities for downstream analysis tasks like clustering and trajectory inference.

## Installation

### Prerequisites

- Python 3.8+
- PyTorch 1.9+
- CUDA-compatible GPU (recommended)

### Setup

1. Clone the repository:
```bash
git clone <repository-url>
cd mixed_diffusion
```

2. Install the package:
```bash
pip install -e .
```

3. Install additional preprocessing dependencies:
```bash
pip install -r requirements-preprocessing.txt
```

## Quick Start

### Basic Usage

Train on single-cell data:
```bash
python scripts/main.py --dataset single_cell --data_config data/CITEseq/15pca.json --from_scratch --result_dir results/pbmc --model TabularDiffusionMLP --gibbs_iterations 50
```


### Single-Cell Genomics
- **PBMC**: Peripheral blood mononuclear cells
- **Cortex**: Mouse cortex single-cell data
- **Pancreas**: Pancreatic islet cells
- **CITE-seq**: Combined protein and RNA measurements

Preprocessing scripts are under the respective folders. Data needs to be imported in `.h5ad` format.

## Model Architectures

### TabularDiffusionMLP
Multi-layer perceptron designed for tabular single-cell data with time embedding and residual connections.


## Advanced Usage

### Grid Search Optimization

Run hyperparameter optimization:
```bash
python scripts/grid_search_synthetic.py --output_dir results/grid_search
```

### Clustering Analysis

Generate clustering metrics with R integration:
```bash
# Run denoising with data saving
python scripts/main.py --dataset pbmc --save_data --result_dir results/pbmc

# Run R clustering analysis
Rscript scripts/clustering_metrics.R results/pbmc/
```

### Visualization and Analysis

Visualize training progress:
```bash
python scripts/show_diffusion.py --result_dir results/pbmc --dataset pbmc --grid_plot --show
```

Analyze saved results:
```bash
python scripts/visualize_saved_results.py --result_dir results/pbmc
```

## Project Structure

```
mixed_diffusion/
├── src/mixed_diffusion/          # Core library
│   ├── models/                   # Model architectures
│   ├── data_loading/            # Data loading utilities
│   ├── preprocessing/           # Data preprocessing
│   ├── evaluation/             # Metrics and evaluation
│   ├── sampling.py             # Gibbs sampling implementation
│   └── visualize.py            # Visualization tools
├── scripts/                     # Executable scripts
│   ├── main.py                 # Main training/inference script
│   ├── grid_search_*.py        # Hyperparameter optimization
│   └── clustering_metrics_*.R  # R analysis scripts
├── data/                       # Dataset storage
└── results/                    # Output directory
```

## Key Scripts

- **`scripts/main.py`**: Primary entry point for training and inference
- **`scripts/grid_search_synthetic.py`**: Automated hyperparameter search
- **`scripts/analyze_grid_search_results.py`**: Analysis of optimization results
- **`scripts/clustering_metrics.R`**: R-based clustering evaluation
- **`scripts/visualize_saved_results.py`**: Post-processing visualization

## Output and Results

The framework generates:
- **Denoised data**: Clean single-cell expression matrices
- **Visualizations**: UMAP plots, comparison charts, training curves
- **Metrics**: Clustering accuracy, silhouette scores, biological metrics
- **Embeddings**: Low-dimensional representations for downstream analysis

## Citation

If you use this code in your research, please cite:

```bibtex
@software{mixed_diffusion,
  title={Mixed Diffusion: A Unified Framework for Single-Cell Data Denoising},
  author={[Authors]},
  year={2024},
  url={https://github.com/[username]/mixed_diffusion}
}
```

## Contributing

Contributions are welcome! Please see our contributing guidelines and submit pull requests for any improvements.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support

For questions and issues:
- Open an issue on GitHub
- Check the documentation in the `docs/` directory
- Review example notebooks in the repository root

## Acknowledgments

This work builds upon advances in diffusion models and single-cell genomics analysis. We thank the open-source community for foundational tools including PyTorch, scanpy, and scvi-tools.