# Shape-of-Thought: Progressive Object Assembly via Visual Chain-of-Thought

## Overview

Shape-of-Thought (SoT) is a visual Chain-of-Thought framework that enables **progressive shape assembly** represented as coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations.

![Image](assets/SoT.png)

## Key Contributions

1. **Shape-of-Thought (SoT):** A visual Chain-of-Thought framework that decomposes shape assembly into sequential 2D visual sub-goals, enabling a single autoregressive model to capture structure-aware compositional regularities induced by part-based CAD supervision.

2. **SoT-26K:** A large-scale dataset of 26K grounded assembly traces with an automated generation pipeline, derived from part-based CAD hierarchies.

3. **T2S-CompBench:** A hybrid benchmark combining VLM semantic judging and geometric mask stability for evaluating progressive shape assembly.

## Key Results

- **88.4%** on component numeracy (CN)
- **84.8%** on structural topology (VT)
- **~20% improvement** over text-only baselines

## Dataset

The **SoT-26K** dataset contains 25,929 object traces across 24 diverse object categories, each paired with interleaved text-image assembly traces. The dataset is organized as:

```
SoT-26K/
├── Bottle/
│   ├── train/
│   │   ├── train-00000-of-00001.parquet
│   │   └── ...
│   ├── val/
│   └── test/
├── Chair/
│   ├── train/
│   ├── val/
│   └── test/
├── Table/
│   ├── train/
│   ├── val/
│   └── test/
└── ... (other categories: Cabinet, Lamp, Faucet, etc.)
```

## Repository Structure

```
├── modeling/              # Model implementations
│   └── bagel/            # Unified multimodal autoregressive model architecture
├── data/                  # Dataset implementations
│   ├── interleave_datasets/  # Interleaved text-image datasets
│   │   └── sot_dataset.py   # SoT dataset loader
│   └── parquet_utils.py   # Parquet data utilities
├── scripts/               # Training and evaluation scripts
│   └── train.sh
├── SoT-26K/              # Dataset files (download separately)
├── assets/               # Visualization assets
├── train/                # Training utilities
└── sot_inference.py      # Main inference script
```

## Model Architecture

SoT is built on top of the unified multimodal autoregressive Transformer. The model uses:

- **Early-fusion multimodal Transformer** for interleaved text-image processing
- **Rectified-flow velocity predictor** for visual generation in VAE latent space
- **Hard routing** with dedicated generation and understanding experts

### Setup

```bash
conda create -n sot python=3.10 -y
conda activate sot
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
```

### Download checkpoint
The checkpoint will be released upon acceptance. For now, you can use the pretrained unified multimodal autoregressive model as the base model and fine-tune it on SoT-26K sample datasets.


### Inference

![Image](assets/results.png)

The inference script (`sot_inference.py`) supports inherent interleaved text and visual reasoning for progressive shape assembly. To customize it for your specific use case:

##### 1. Model Checkpoint Path

Update the checkpoint path to point to your model:

```python
checkpoint_dir = "/path/to/your/HF_HOME/models/"
```

You can also use the local dir:

```
checkpoint_dir = f"{HF_HOME}/models/"
```

##### 2. Setting up prompts

Edit the prompt in `sot_inference.py` (around lines 196-207):

**SoT example prompts for shape assembly:**

```python
sot_prompts = [
    "Build a rectangular container with a wide open top, thick vertical sides, and a recessed base, featuring smooth edges and a uniformly solid appearance.",
    "Construct a compact round planter with a sturdy cylindrical base filled with soil, topped by tall, slender leaves extending upward in a clustered arrangement.",
    "Create a rounded ceramic pot with a smooth, bulbous body and a thick, rolled rim circling the wide opening.",
    "Construct a sleek, S-shaped chair with a continuous curved base forming a seamless transition into the seat and back, featuring a single smooth surface design without visible joints or separations.",
    "Construct a three-headed chandelier with curved lamp arms extending from a central unit, each supporting a wide, tapered lampshade, connected by a vertical chain for suspension.",
    "Construct a rectangular ping pong table featuring a smooth, flat tabletop surface with a centered net divider, supported by four tapered legs connected by bar stretchers for stability.",
]

prompt = sot_prompts[0]  # Select a prompt from the list
```

##### 3. Inference Parameters

You can adjust the generation parameters in the `inference_hyper` dictionary:

```python
inference_hyper = dict(
    do_sample=True,
    text_temperature=0.3,
    cfg_text_scale=4.0,
    cfg_img_scale=2.0,
    cfg_interval=[0.0, 1.0],
    timestep_shift=3.0,
    num_timesteps=50,
    cfg_renorm_min=0.0,
    cfg_renorm_type="text_channel",
)
```

##### 4. Running Inference

```bash
python sot_inference.py
```

The script will:
- Generate interleaved textual reasoning and visual intermediate states
- Save step-by-step images in `reasoning_output_<timestamp>/images/`
- Save reasoning metadata to `reasoning_output_<timestamp>/reasoning_data.json`
- Generate a final complete image at the end

For details, refer to the original inference script [here](sot_inference.py).

#### Example Use Cases

```python
# Shape assembly task
prompt = "Build a rectangular container with a wide open top, thick vertical sides, and a recessed base"
```

### Training
For training on SoT-26K, run:

```bash
bash scripts/train.sh
```


The interleaved reasoning data for Shape-of-Thought can be found in [sot_dataset.py](data/interleave_datasets/sot_dataset.py).

