## Complete Workflow

This section provides a step-by-step guide to generate data, train, and evaluate ACE models.

### 1. Generate Offline Data

#### Tabular Regression Data Generation

The project includes a tabular regression data generator that creates synthetic regression tasks with variable feature dimensions. This is designed for training ACE models on tabular data with Hugging Face integration.

##### Tabular Dataset Generation (5GB Dataset)

Generate a large-scale tabular regression dataset for training:

```bash
# Generate 5GB tabular dataset with variable 2-3D features
uv run python generate_offline_tabular.py \
    --target_size_mb 5000 \
    --batch_size 64 \
    --num_features "2,3" \
    --num_context "128" \
    --num_buffer 32 \
    --num_target 256 \
    --dtype float32 \
    --output_dir data/tabular/5gb
```

**Dataset Specifications:**
- **Size**: ~5GB (12,787 batches × 64 samples)
- **Feature dimensions**: Randomly alternates between 2D and 3D
- **Context points**: Fixed at 128 per task
- **Buffer points**: Fixed at 32 (for ACE drafting)
- **Target points**: Fixed at 256
- **Data type**: float32 for efficiency
- **Storage**: 128 chunk files (100 batches per chunk)

**Generated Structure:**
```
data/tabular/5gb/
├── metadata.json         # Dataset configuration
├── chunk_0.pt           # First 100 batches
├── chunk_1.pt           # Next 100 batches
└── ...                  # Up to chunk_127.pt
```

**Data Format:**
Each chunk contains tasks with:
- `xc`: Context inputs [batch_size, num_context, dim_x]
- `yc`: Context outputs [batch_size, num_context, 1]
- `xb`: Buffer inputs [batch_size, num_buffer, dim_x]
- `yb`: Buffer outputs [batch_size, num_buffer, 1]
- `xt`: Target inputs [batch_size, num_target, dim_x]
- `yt`: Target outputs [batch_size, num_target, 1]

Where `dim_x` varies between 2 and 3 across tasks in each batch.

#### GP Data Generation for Reproducible Training

The project includes a comprehensive GP data generation pipeline designed for numerical stability and reproducibility. The generated datasets support experiments with variable context sizes and multiple buffer configurations.

##### Dataset Specifications

**Training Dataset:**
- **Size**: 10,000 batches × 128 samples = 1,280,000 GP functions
- **Context points**: Randomly selected from [4, 8, 16, 32, 48, 64, 128, 192] per batch
- **Buffer configurations**: 3 versions created from the same data:
  - 16 buffer points (original)
  - 8 buffer points (first 8 from original)
  - 4 buffer points (first 4 from original)
- **Target points**: 256
- **Total points per function**: context + buffer + target (varies: 276-464 points)

**Test/Validation Dataset:**
- **Size**: 100 batches × 128 samples = 12,800 GP functions
- **Context points**: Fixed at 8 (for consistent evaluation)
- **Buffer points**: 0 (no buffer for test)
- **Target points**: 256
- **Total points per function**: 264 points

##### GP Hyperparameters

All datasets use the following GP configuration for reproducibility:

```yaml
# Kernel mixture
kernels: ["rbf", "matern32", "matern52"]
kernel_weights: [0.4, 0.3, 0.3]

# GP parameters
input_range: [[-2.0], [2.0]]  # 1D input
lengthscale_range: [0.1, 1.0]
variance_range: [0.5, 1.5]
noise: 0.0  # Noiseless GPs

# Numerical stability
dtype: float64  # Double precision
jitter: 1e-4    # Higher than default for Cholesky stability
seed: 42        # Fixed seed for reproducibility
```

##### Generation Commands

**Local Generation:**
```bash
# Generate all datasets (train + test)
./scripts/generate_gp_datasets.sh
```

**Manual Generation (for custom configurations):**
```bash
# Training data with 16 buffer points
uv run python -m src.data.generate_offline_data \
    --config-name offline_data_gp_highprecision \
    output_dir=data/gp_128batch_16buf_256tar \
    generation.num_batches=10000 \
    generation.batch_size=128 \
    generation.num_buffer=16 \
    generation.num_target=256

# Test data with fixed 8 context
uv run python -m src.data.generate_offline_data \
    --config-name offline_data_gp_highprecision \
    output_dir=data/gp_128batch_test \
    generation.num_batches=100 \
    generation.batch_size=128 \
    generation.num_context=8 \
    generation.num_buffer=0 \
    generation.num_target=256

# Create 8-buffer and 4-buffer versions
uv run python -m src.data.gp.split_buffer_datasets
```
