# DFA-NSP

This repository contains data generation, language-model training, NSP labeling, and L\*-NSP learning scripts used in the project.

## Setup

1) Install PyTorch following the official instructions for your OS/CUDA setup (https://pytorch.org/get-started/locally/).

2) Install the Python requirements (Graphviz Python binding is included; system Graphviz binaries must also be available in `PATH` for rendering DFA diagrams):
```bash
pip install -r requirements.txt
```
If Graphviz is not already installed system-wide, install it via your package manager (e.g., `apt-get install graphviz` or `brew install graphviz`) before running the scripts.

## Example Workflow (Tomita-2)

All commands are meant to be run from the repository root (`DFA-NSP/`).

1) **Generate data**
```bash
python -m datagen.generate \
  --language tomita2 \
  --num-train 5000 \
  --num-test 1000 \
  --data-name sample_run \
  --max-context 256 \
  --max-str-len 80 
```
This writes train/eval JSONL and vocab to `data/tomita2/sample_run/`.

2) **Train a Transformer LM**
```bash
python train_lm.py \
  --run-name tomita2_model \
  --data-dir data/tomita2/sample_run \
  --max-steps 10000 \
  --device 0
```
The trained model is saved under `models/tomita2_model/`.

3) **Generate NSP-labeled data from the trained LM**
```bash
python nsp_data_gen.py \
  --run_name tomita2_nsp \
  --model_dir models/tomita2_model \
  --sample_strategy \
  --strategy min-p \
  --param 5e-2 \
  --num_train 1000 \
  --num_eval 1000 \
  --device 0
```
This writes NSP-labeled data to `dfa_data/nsp_data/tomita2_nsp/`.

4) **Run L*-NSP on the labeled data**
```bash
python run_lstar_nsp.py \
  --run_name tomita2_lstar_nsp \
  --data_dir dfa_data/nsp_data/tomita2_nsp \
  --num_train 100 \
  --device 0
```
Outputs (learned DFA JSON/PNG and summary) are saved under `dfa_out/lstar_nsp/tomita2_lstar_nsp/`.
