# TideGS Minimal - Minimal Working Example

## Overview

This is a minimal working example of the TideGS for 3D Gaussian splatting with out-of-core traing. Due to the large dataset requirements and extensive dependencies of the full implementation, we provide this supplementary material with a simplified version to reduce the barrier to entry for reviewers.

### Why a Minimal Version?

The full codebase requires:
- **Large-scale datasets** (>1TB of scene data)
- **Complex dependencies** (CUDA-specific optimizations, custom kernels, dataset-specific utilities)
- **Hardware-specific tuning** (requires specific GPU types and configurations)

To facilitate reproducibility and allow direct execution during review, we have:
- **Stripped hardware-specific dependencies** while preserving core algorithmic concepts
- **Used synthetic data** instead of requiring actual datasets
- **Simplified data I/O** without sacrificing the three-tier storage mechanism
- **Maintained the essential TideGS pipeline** for demonstration

**After paper acceptance, the complete codebase will be fully open-sourced.**

---

## Quick Start

### Prerequisites
```bash
# Python 3.8+
torch torchvision numpy tqdm
```

### Running the Minimal Example

```bash
python train.py \
    --num_points 500000 \
    --batch_size 8 \
    --iterations 1000 \
    --device cuda \
    --storage_dir ./my_ssd_storage
```

---

## Architecture


### Key Concepts

#### 1. **Morton Code Sorting**
- Spatially sorts Gaussians using Z-order curve
- Adjacent blocks contain spatially adjacent Gaussians
- Improves cache locality and culling efficiency

```python
# From: storage/gaussian_block.py
morton_codes = compute_morton_code(xyz, global_min, global_max)
sorted_indices = torch.argsort(morton_codes)
```

#### 2. **Block-Level Frustum Culling**
- Groups Gaussians into fixed-size blocks (default: 4096 points)
- Culls entire blocks using 6-plane AABB test
- Typical cull rate: 80-95% (reduces GPU memory pressure)

```python
# From: storage/gaussian_block.py
class FrustumCuller:
    def cull(self, view_matrix, proj_matrix):
        planes = self._extract_frustum_planes(view_matrix, proj_matrix)
        visible = []
        for block_id in range(self.num_blocks):
            if self._test_aabb_planes(block_id, planes):
                visible.append(block_id)
        return visible
```

#### 3. **Hotspot Retention**
- Reuses blocks across consecutive batches
- Avoids redundant GPU↔RAM transfers
- ~50% reduction in memory bandwidth usage

```python
# From: strategies/gpu_working_set.py
hotspot_blocks = visible_set & prev_set  # Cross-batch intersection
if enable_retention and block_id in hotspot_blocks:
    gpu_tensors[...] = self.previous_block_data[block_id]  # Reuse
```

#### 4. **Three-Tier Pipeline**
- **Iteration N**: GPU renders current blocks, RAM prefetches iteration N+1
- **Iteration N+1**: GPU can immediately use prefetched blocks with minimal stall
- Overlaps I/O with computation

```python
# From: storage/storage_adapter.py
def prefetch_for_next_iteration(self, iteration, batch_size, training_schedule):
    # Look ahead to next iteration's needed blocks
    self.cache.prefetch(list(all_blocks))
```

---

## File Structure

```
TideGS_code/
├── storage/
│   ├── __init__.py                  # Module exports
│   ├── log_storage_manager.py        # SSD tier (Log-Structured Storage)
│   │                                  # - read_blocks()
│   │                                  # - write_patch()
│   │                                  # - Append-only log structure
│   ├── tiered_cache_manager.py       # RAM tier (Cache Management)
│   │                                  # - prefetch()
│   │                                  # - LRU eviction
│   │                                  # - Dirty block tracking
│   ├── gaussian_block.py             # Data structures
│   │                                  # - GaussianBlock
│   │                                  # - compute_morton_code()
│   │                                  # - FrustumCuller
│   └── storage_adapter.py            # Core orchestrator
│                                      # - SSDStorageAdapter
│                                      # - get_visible_blocks()
│                                      # - prefetch_for_next_iteration()
├── strategies/
│   ├── __init__.py
│   ├── gpu_working_set.py            # GPU tier (Working Set Management)
│   │                                  # - GPUWorkingSet
│   │                                  # - load_visible_blocks_with_retention()
│   │                                  # - Hotspot tracking
│   └── engine.py                     # Training loop core
│                                      # - ssd_offload_train_one_batch()
│                                      # - calculate_filters()
├── train.py                          # Main training script
└── README.md                           # This file
```

---

## Data Layout

Each Gaussian point occupies **59 float32 values** (236 bytes):

| Component | Dims | Bytes | Purpose |
|-----------|------|-------|---------|
| xyz | 3 | 12 | Position |
| scaling | 3 | 12 | Log scale |
| rotation | 4 | 16 | Quaternion |
| opacity | 1 | 4 | Alpha |
| features_dc | 3 | 12 | DC SH coefficients |
| features_rest | 45 | 180 | High-order SH coefficients |
| **Total** | **59** | **236** | |

### Block Structure
- Default block size: **4096 Gaussians**
- Per-block memory: **~1MB** (4096 × 236 bytes)
- 1M Gaussians → 245 blocks
- Fit in GPU memory (even on GPUs with <8GB VRAM)

---

## Key Implementation Details

### 1. SSD Storage Layer (`log_storage_manager.py`)

**Purpose**: Cold storage on SSD with log-structured writes

```python
class LogStorageManager:
    def __init__(self, storage_dir, base_file_path):
        self.base_file = base_file_path          # Initial snapshot
        self.patches_dir = storage_dir / "patches"
        self.block_index = {}                    # Block → file location
    
    def write_patch(self, block_id, data):
        # Append-only write of modified block
        # Only store changed blocks (not full rewrite)
        
    def read_blocks(self, block_ids):
        # Read from base_file + apply patches
        # Efficient sequential reads
```

**Key features:**
- Append-only log (no random writes)
- Patch-based updates (only modified blocks)
- Sequential read optimization

### 2. RAM Cache Layer (`tiered_cache_manager.py`)

**Purpose**: Intermediate cache with LRU eviction and prefetch

```python
class TieredCacheManager:
    def __init__(self, max_ram_gb=16.0):
        self.cache = {}                          # block_id → tensor
        self.dirty_blocks = set()                # Modified blocks
        self.access_order = deque()              # LRU tracking
    
    def prefetch(self, needed_block_ids):
        # Look ahead: load blocks for next iterations
        # Overlaps I/O with computation
        
    def evict_lru(self):
        # Remove least recently used blocks
        # Write dirty blocks back to SSD
```

**Key features:**
- LRU eviction (minimize memory usage)
- Dirty tracking (only sync changed blocks)
- Async prefetch (hide I/O latency)

### 3. GPU Working Set (`strategies/gpu_working_set.py`)

**Purpose**: GPU-side management with hotspot retention

```python
class GPUWorkingSet:
    def load_visible_blocks_with_retention(self, visible_block_ids, 
                                          active_blocks_ram,
                                          enable_retention=True):
        visible_set = set(visible_block_ids)
        prev_set = set(self.previous_blocks)
        
        # Hotspot: blocks visible in both current and previous iteration
        hotspot_blocks = visible_set & prev_set if enable_retention else set()
        cold_blocks = visible_set - hotspot_blocks
        
        # Reuse hotspot blocks (already on GPU)
        # Load cold blocks from RAM
        
        self.previous_blocks = list(visible_set)
        return gpu_tensors, retention_stats
```

**Key features:**
- Cross-batch block reuse (50% bandwidth saving)
- Explicit hotspot tracking
- Retention statistics for monitoring

### 4. Training Engine (`strategies/engine.py`)

**Purpose**: Main batch training loop with TideGS

```python
def ssd_offload_train_one_batch(gaussians, batched_cameras, 
                               storage_adapter, ...):
    # 1. Visibility culling
    visible_block_ids = set()
    for camera in batched_cameras:
        blocks = storage_adapter.get_visible_blocks(camera.global_idx)
        visible_block_ids.update(blocks)
    
    # 2. Load blocks with hotspot retention
    gpu_tensors, _ = gaussians.gpu_working_set_manager\
        .load_visible_blocks_with_retention(visible_block_ids)
    
    # 3. Update Gaussian parameters from GPU tensors
    gaussians._xyz = nn.Parameter(gpu_tensors['xyz'])
    # ... (scaling, rotation, opacity, features)
    
    # 4. Render and compute loss
    for camera in batched_cameras:
        rendered_image = render_fn(gaussians, camera)
        loss = loss_fn(rendered_image, camera.original_image)
        loss.backward()
    
    # 5. Optimizer step
    gaussians.optimizer.step()
    
    # 6. Update RAM cache with modified blocks
    storage_adapter.update_ram_cache(gaussians, visible_block_ids)
```

---


## Training Arguments

```bash
python train.py [OPTIONS]
```

| Argument | Default | Description |
|----------|---------|-------------|
| `--num_points` | 100000 | Number of Gaussian points |
| `--batch_size` | 4 | Batch size (cameras per iteration) |
| `--iterations` | 100 | Training iterations |
| `--device` | cuda | Device (cuda or cpu) |
| `--storage_dir` | ./ssd_storage | SSD storage directory |

---

## Reproducibility & Citation

This minimal implementation demonstrates the core concepts of the TideGS method:
- **Three-tier storage** (SSD → RAM → GPU)
- **Frustum culling** with Morton-ordered blocks
- **Hotspot retention** for cross-batch reuse
- **Prefetch pipeline** for I/O-compute overlap

For the full implementation with:
- Complete datasets
- Optimized CUDA kernels
- Comprehensive benchmarks
- All experimental configurations

**Please refer to the full codebase (to be released upon paper acceptance).**

---

## Contact & Support

For questions about this minimal example or the full TideGS implementation, please refer to the paper and the full codebase release.

---

## License

This minimal working example is provided for review purposes. The full implementation will be released under [appropriate license] upon paper acceptance.
