This repository contains the implementation code for MDShortcut.

## Project Structure

The codebase is organized into three main directories:

- **`src/`** - Source code implementation
  - `main.py` - Main entry point for training and inference
  - `pipeline.py` - Training, testing, and inference pipelines
  - `data.py` - Dataset loading and preprocessing utilities
  - `utils.py` - Utility functions and TensorBoard logging
  - `models/` - Neural network architectures
    - `denoisers/` - Denoising model implementations (EGNN-based)
    - `networks/` - Core network architectures
    - `modules/` - Supporting modules (schedulers, losses, guidance functions)

- **`scripts/`** - Analysis and evaluation scripts
  - `SiO2_analysis/` - Silica structure analysis tools
  - `Si_analysis/` - Silicon structure analysis tools

- **`settings/`** - Configuration files (if present)
  - YAML/JSON configuration files for different experiments

## Key Features

The MDShortcut framework supports:

- **EGNN-based Denoising Models**: Graph neural networks for atomic structure generation
- **Flexible Schedulers**: SDE and Flow-based noise schedules for diffusion processes
- **Property Conditioning**: Conditional generation based on material properties
- **Guidance Functions**: Custom guidance for charge balance and other constraints
- **Multiple Loss Functions**: L1, L2, Huber, and KL divergence losses with configurable weighting
- **Model Compilation**: PyTorch 2.0 compilation for improved performance
- **Trajectory Saving**: Full diffusion trajectories for analysis and visualization

## Setup

### Dependencies

The implementation requires:
- Python 3.8+
- PyTorch 2.0+
- ASE (Atomic Simulation Environment)
- NumPy, SciPy, PyYAML
- TensorBoard for logging
- tqdm for progress bars

### Installation

```bash
# Clone the repository
git clone <repository_url>
cd MDShortcut-code

# Install dependencies
pip install torch ase numpy pyyaml tensorboard tqdm python-dotenv requests
```

### Environment Variables

Set the following environment variables:

```bash
export SAVE_ROOT_DIR="cache"  # Directory for saving models, logs, and results
export NOTIFY_URL="<notification_service_url>"  # Optional: for experiment notifications
export NOTIFY_TOKEN="<notification_token>"     # Optional: authentication token
```

## Usage

### Command Line Interface

The main entry point accepts the following arguments:

```bash
python src/main.py -s <settings_file> -g <gpu_id> [--compile]
```

**Arguments:**
- `-s, --setting`: Path to YAML/JSON configuration file (required)
- `-g, --cuda`: GPU device index (default: 0, use -1 for CPU)
- `--compile`: Enable PyTorch model compilation for performance

### Configuration System

Configuration files specify all aspects of training and inference:

#### Model Configuration
```yaml
model:
  name: "egnn_denoiser"
  params:
    r_cut: 5.0
    elements: ["C", "N", "O", "H"]
    properties:
      density:
        dim: 1
        d_prop_embed: 8
    # Additional EGNN parameters...
```

#### Scheduler Configuration
```yaml
scheduler:
  name: "sde"  # or "flow"
  params:
    sigma_max_pos: 1.0
    sigma_max_el: 1.0
    t_min: 0.01
    t_max: 1.0
```

#### Loss Configuration
```yaml
loss:
  params:
    norm_type: "huber"
    element_norm_type: "l2"
    position_weight: 1.0
    element_weight: 0.1
```

#### Training Configuration
```yaml
train:
  enabled: true
  data:
    dataset:
      atom_src:
        type: "extxyz"
        params:
          file: "path/to/structures.extxyz"
      property_src:
        type: "file"
        params:
          file: "path/to/properties.json"
    dataloader:
      batch_size: 32
      num_workers: 4
      shuffle: true
  params:
    lr: 1e-4
    num_epochs: 1000
    clip_grad_norm: 1.0
```

### Example Usage

**Training a model:**
```bash
python src/main.py -s config/train.yaml -g 0 --compile
```

**Running inference:**
```bash
python src/main.py -s config/infer.yaml -g 0
```

## Data Formats

The implementation supports:

### Atomic Structures
- **ExtXYZ files**: Standard extended XYZ format with properties in info section
- **Empty cells**: Generate empty unit cells with specified dimensions

### Material Properties  
- **JSON files**: Property data in JSON format
- **YAML files**: Property data in YAML format
- **Multiple files**: Merge properties from multiple sources
- **Augmented properties**: Generate properties with specified distributions

#### Expected File Structure
```
datasets/
├── material_name/
│   ├── structures.extxyz
│   └── properties.json
```

#### Property File Format
```json
[
  {
    "density": 2.65,
    "elastic_modulus": 70.0,
    "property_name": value
  },
  ...
]
```

## Advanced Features

### Training Shortcuts
Enable shortcut training for improved sample quality under limited sampling steps:

```yaml
shortcut:
  enabled: true
  shortcut_per: 0.1  # Apply to 10% of batches
  reverse_step_params:
    num_cand: 16
```

### Guidance Functions
Use guidance for conditional generation:

```yaml
guidance:
  name: "charge_balance"
  params:
    strength: 1.0
```

### Trajectory Saving
Save full diffusion trajectories for analysis:

```yaml
infer:
  params:
    save:
      trajectories:
        enabled: true
        batches: [0, 1, 2]  # or "all"
```

## Output Structure

Results are saved in the `SAVE_ROOT_DIR` with the following structure:

```
cache/
├── models/          # Model checkpoints
│   └── experiment_name/
│       ├── 00100.pt
│       └── final.pt
├── log/            # TensorBoard logs
│   └── experiment_name/
└── infer/          # Inference results
    └── experiment_name/
        ├── inferred.extxyz
        ├── given_properties.json
        └── trajectories/
```

## Analysis Scripts

The `scripts/` directory contains analysis tools:

- **SiO2 Analysis**: Silica structure analysis including RDF, bond angles, ring statistics
- **Si Analysis**: Silicon structure analysis tools

