## Setup

Setup the environment with [pixi](https://pixi.sh/latest/installation/)

```bash
$ pixi install
```

## Getting Started

- The `SyntheticDataset` class only requires a `seed` and a `Config` object that contains `database`, `scm` and `dag` level params for sampling. See example below.

```py
from rt.synthetic.dataset import SyntheticDataset
from rt.synthetic.config import Config

# create config with default values/choices for params
config = Config()
seed = 0

# create relbench compatible dataset
dataset = SyntheticDataset(seed=seed, config=config)

# create database which can be cached via relbench APIs
db = dataset.make_db()
```

### Config

The `Config` object is a collection of constants as well as `Choices` from which values can be sampled uniformly at random. For instance: the `scm_layout_choices` indicates that the layout can be selected at random from the four available choices. Similarly, `Choices` also supports sampling from a `"range"` of values (see `scm_col_node_perc_choices` in snippet below.)

```py
@dataclass(frozen=True)
class SCMParams:
    ...
    scm_layout_choices: Choices = Choices(
        kind="set",
        value=["ErdosRenyi", "BarabasiAlbert", "RandomTree", "ReverseRandomTree"],
    )
    scm_col_node_perc_choices: Choices = Choices(kind="range", value=[0.6, 0.9])
```

## Experiments


```bash
# generate data with seed offsets. Can be any range.
$ pixi run python -m rt.synthetic.gen --seed_offset=14000 --num_dbs=2000

# baseline (real-world) pretraining.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/baseline_pretrain.py

# synthetic pretraining --> full grid is computationally expensive.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/synthetic_pretrain.py

# take checkpoints from synthetic pretraining for contd. pretraining.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/cntd_pretrain.py
```