# LTM: Latent Table Modeling (VAE) for Tabular Data

A modular toolkit for learning latent representations of tables. It provides:
- Column-wise vectorizers for numerical, categorical, text and datetime columns
- A configurable Variational Autoencoder (VAE) stack for tables (uni/multi-modal variants)
- Data preprocessing and loading from Parquet+JSON configs or LMDB
- A training/evaluation CLI and simple programmatic APIs

This README is written for conference reviewers to quickly understand structure and usage.

## Repository Structure

```
ltm/
├── dataset/                    # Data preprocessing & loaders
│   ├── dataTransformer.py      # Build per-column transforms from JSON config
│   ├── dataset_local.py        # Parquet + JSON dataset batching utilities
│   ├── dataset_lmdb.py         # LMDB-backed dataset
│   └── transformers/           # Column transformers used for normalization/encoding
├── vectorizer/                 # Column → embedding and table-level vectorization
│   ├── TableVectorizer.py      # Orchestrates per-column vectorizers and metadata encoding
│   └── columnVectorizer/       # Column-level vectorizers
│       ├── numerical.py        # PLE/quantile-based numerical embeddings
│       ├── categorical.py      # LM-backed categorical embeddings
│       ├── text.py             # Text embeddings
│       └── datetime.py         # Date/time embeddings
├── latent/                     # Table latent model and VAE
│   ├── TableLatentModel.py     # High-level API: prepare_data/train/encode/decode
│   ├── utils.py                # Visualization utilities
│   └── vae/perceive/           # VAE models and trainer
│       ├── base.py / trainer.py / perceive*.py
├── pipeline_trainVAE.py        # CLI: training, checkpointing, reconstruction
├── testVAE.sh                  # Example CLI invocation
└── README.md                   # This file
```

## Installation

Python 3.9+ recommended.

```bash
python -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # choose CPU/CUDA per system
pip install pandas numpy scikit-learn transformers pyyaml tqdm lmdb msgpack matplotlib pyarrow
```

## Data Formats

### Parquet + JSON (local)
- Place Parquet files in: `<DATA_ROOT>/parquet/*.parquet`
- Place matching JSON configs in: `<DATA_ROOT>/config/*.json`
- Each pair shares the same basename, e.g. `mytable.parquet` and `mytable.json`.

Each JSON describes columns and types. Minimal schema:

```json
{
  "base_name": "mytable",
  "description": "short table description",
  "variables": [
    {"variable_name": "age", "variable_type": "numerical"},
    {"variable_name": "country", "variable_type": "categorical"},
    {"variable_name": "title", "variable_type": "text"},
    {"variable_name": "created_at", "variable_type": "datetime"}
  ]
}
```

During preprocessing, numerical columns are fitted with either Piecewise Linear Encoding (PLE) or Quantile transforms; categorical columns get a `categories` list added.

### LMDB
Pass `--use_lmdb` along with `--data_folder <LMDB_DIR>`. The directory must contain an LMDB database and a CSV log of keys (see `dataset_lmdb.py`).

## Quick Start (CLI)

Train from Parquet + JSON:

```bash
python ltm/pipeline_trainVAE.py \
  --data_folder /path/to/DATA_ROOT \
  --num_epochs 10 \
  --batch_size 128 \
  --d_lm 1024 --d_latent_len 64 --d_latent_width 128 \
  --encoder_depth 6 --decoder_depth 6 \
  --numerical_transformation ple \
  --checkpoint_folder /path/to/checkpoints \
  --reconstruct_folder /path/to/outputs \
  --interval_type step --scheduler_interval step \
  --max_steps 6500 --save_interval 100 --early_stop_patience 5
```

Train from LMDB:

```bash
python ltm/pipeline_trainVAE.py \
  --use_lmdb \
  --data_folder /path/to/LMDB_DIR \
  --num_epochs 10 \
  --batch_size 128 \
  --checkpoint_folder /path/to/checkpoints \
  --reconstruct_folder /path/to/outputs
```

Notes:
- Choose `--autoencoder_type` from `unimodal | multimodal | disentangled | ae`.
- For multimodal, `--combination_method` supports `poe | moe | mopoe | samopoe`.
- Device selection: `--device cuda|cpu` (defaults to CUDA if available). Set `--device cpu` to fully disable GPU.
- Distributed training is supported via environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`).

### Reconstruction and Latent Export
Set `--test_construct` to:
- `1`: reconstruction
- `2`: reconstruction with latent interpolation
- `3`: random samples from prior
- `4`: reconstruction without target (assumes last variable is target)

Outputs (under `--reconstruct_folder/<experiment_name>/`):
- `*_reconstructed.csv`, `*_original.csv`, optionally `*_target.csv`
- `*_latent.npy` with saved latent tensors

## Programmatic Usage

```python
import torch, json, pandas as pd
from ltm.latent.TableLatentModel import TableLatentModel

# Load trained model
model = TableLatentModel(autoencoder_type='multimodal', device='cpu')
ckpt = torch.load('/path/to/checkpoints/<exp>/best_model.pth', map_location='cpu', weights_only=True)
model.load_checkpoint(ckpt)

# Prepare a dataframe and its config
df = pd.read_parquet('/path/to/sample.parquet')
config = json.load(open('/path/to/sample.json'))

# Encode to latent and decode back
Z = model.table_to_latent(df, config, batch_size=64)      # (N, L, D)
df_rec = model.latent_to_table(Z, config, batch_size=64)  # reconstructed table
```

## Tips & Troubleshooting
- Use `--save_output --log_file output.log` to capture logs.
- `--resume_from_checkpoint` continues training from the latest checkpoint in `--checkpoint_folder`.
- For numerical stability, ensure JSON configs correctly specify `variable_type` for each column.

## License

This project inherits its license from the parent repository’s `LICENSE` file.
