

# Dataset Overview

This is an overview of our dataset designed for symbolic regression tasks. It basically contains three parts: real-world-IWLS`(circuits/SAT/...)`, real-world-BBM`(biology)`, and scalable synthetic logic networks. Each of the dataset contains a collection of truth tables saved in folder named `in{IN}_out{OUT}` representing the number of inputs and outputs , and the synthetic data also provided corresponding logic networks, which might be extended for usage in logic synthesis algorithms. 

## Folder Structure

```
dataset/
├── real_world_BBM/
│   ├── scripts/
│   │   ├── read_sbml.py
│   │   ├── update.py
│   │   └── vis_sbml.py
│   ├── tseq_truth_tables/
│   │   ├── noise_00.zip
│   │   ├── noise_01.zip
│   │   └── noise_05.zip
│   ├── readme.md
│   └── real_world_IWLS/
│       ├── IWLS.zip
│       └── iwls_data_converter.py
├── synthetic_networks/
│   ├── and_not/
│   │   ├── in5_out5.zip
│   │   ├── in10_out10.zip
│   │   └── ...
│   ├── and_not_or/
│   │   ├── in5_out5.zip
│   │   ├── in10_out10.zip
│   │   └── ...
│   ├── composition_analysis/
│   ├── generation_scripts/
│   └── readme.md
└── readme.md
```


---

## Dataset Composition

#### Synthetic Logic Network Dataset (`synthetic_networks/`)

We present a large-scale dataset of **synthetically generated logic networks**, categorized into two types: **AN** (networks constructed using the logical connectives `{AND, NOT}`) and **ANO** (networks constructed using `{AND, NOT, OR}`). The dataset spans a wide range of network sizes, from small circuits with 15 nodes to large-scale networks containing over 2600 nodes.

**File Structure**

The file structures and scales of AN and ANO are identical. The only difference lies in the **additional column** present in the ANO files, as described in the final section of the `AIGER file format` part. Each archive (AN/ANO) contains multiple subfolders categorized by different input/output sizes, and organized as follows:

```aiignore
in{IN}_out{OUT}/
├── and_{X}/
│   ├── aag/
│   │   ├──  *.aag
│   │   └── ...
│   ├── truth/
│   │   ├──  *.truth
│   │   └── ...
│   ├── pic/   # optional, not provided in submission due to limited space. visualization scripts provided
│   │   ├──  *.html
│   │   └── ...
```

where X denoted the number of internal gates, `aag` folder is the generated network structure saved in our modified aiger format, `truth` folder for the corresponding truth tables, and `pic` folder a visualization of the network structure saved in html formats.

**Dataset Scale (for each of the AN/ANO network)**

The dataset scales we generated for testing are as follows:

| in   | out  | num_gates         | generated size | truth table size (2^q) |
| ---- | ---- | ----------------- | -------------- | ---------------------- |
| 5    | 5    | 10 / 20 / 40      | 100/100/100    | 5 (full)               |
| 10   | 10   | 40 / 80 / 160     | 100/100/100    | 10 (full)              |
| 20   | 20   | 160 / 320 / 640   | 100/100/100    | 15                     |
| 40   | 40   | 320 / 640 / 1280  | 100/100 /50    | 15                     |
| 80   | 80   | 640 / 1280 / 2560 | 50/50/50       | 15                     |

The `.truth` files are full or partial truth tables with length ranging from 2^10 to 2^15 (because the the SR methods we tested cannot process truth tables longer than this like 2^20) samples per network to balance completeness and scalability. The scales are chosen as number of in equals number of out here for a clear display, but you can actually generate logic networks and truth tables of any scales using our algorithm on your need.

**Generation Scripts**

The generation scripts are provided in `dataset/synthetic_network.zip/scripts`, are the details for setting and running the script is provided in `README_synthetic_scripts.md`. You can refer to them for more details. 

---

#### Real World Biological Boolean Models (`real_world_BBM/`)

This dataset is derived from the BBM dataset, which provided a set of real-world biological Boolean networks, here we only provide an example dataset **`noise_00`** corresponding to 0% noise due to limited space. The full dataset contains noise levels {0%,1%,5%}, which can be generated by the scripts provided.

**File Structure**

```aiignore
real_world_BBM/
├── tseq_truth_tables/
│   ├── noise_00.zip
│   │   ├── in{IN}_out{OUT}/
│   │   │   ├── xx.truth
│   │   │   └── ...
├── scripts
│   ├── update.py
│   ├── read_sbml.py
│   └── vis_sbml.py
```

**Simulation Modes**

You can choose between **safe** and **fast** evaluation:

- `safe`: Uses Python's AST to safely evaluate Boolean expressions (secure but slightly slower)
- `eval`: Uses Python `eval` (faster but less secure; avoid with untrusted input)

---

**Simulation details:**

- For **networks with <12 variables**: exhaustively enumerate all input combinations; outputs are randomly flipped according to the noise level.
- For **networks with ≥12 variables**: simulate 10,000 time steps; detect steady states (no change for 6 steps) and reinitialize; forced reinitialization every 200 steps to avoid cycles.

**Dataset Scale:** 

- 245 models, three noise levels: 0%, 1%, 5% (only 0% noise sample shown here due to limited space).
- Note: the in{IN} folder name is the number of real inputs to the network (free variables), but the relations in models are variables represented by combinations of other internal variables as well as inputs. So we set all variables as input and all as outputs in the truth table provided to observe the SR methods' ability in distinguishing irrelevant variables, which results in {OUT}*2 number of lines in each of the truth tables.

**Scripts**

Data conversion scripts for reproducing truth tables from raw `model.bnet` files in BBM. If you want to use this script, please first download the original `models`  folder from BBM dataset, and place them in the same folder as the scripts. run update.py, and the results will be automatically saved. The scripts perform logic simulation on BBM and generate outputs in two different formats:

- `update.py`: Saves each model and its corresponding truth table in a single folder **without** in-out naming. The total variable count, number of inputs, input variable names, and truth table are all recorded in a single `.txt` file, providing detailed information.
- `update_inout_folder.py`: Saves outputs directly into folders named by input-output counts (e.g., `in5_out5`), making it easier for downstream methods to process. The provided **`noise_00`** files were generated using this script and serve as a more intuitive representation.

This dataset serves as a benchmark for studying robustness of biological Boolean dynamics under stochastic noise.

---

#### Real World IWLS Dataset (`real_world_IWLS/`)

The IWLS dataset is a collection of benchmark suites from the International Workshop on Logic and Synthesis (IWLS), focusing on logic learning and synthesis tasks involving Boolean functions and circuits. We collected and converted the official IWLS Logic Synthesis Contest benchmarks (2020–2025 editions) into unified `.truth` files, the problems related in these five years are as follows, with some of them in .truth format while others might be in other formats like .pla:

| Year     | Description                                                  |
| -------- | ------------------------------------------------------------ |
| **2020** | Learn single-output Boolean functions from input-output pairs. |
| **2021** | Extended to multi-output functions f: {0,1}ⁿ → {0,1}ᵐ.       |
| **2022** | Logic synthesis benchmarks with more complex constraints and formats. |
| **2023** | Continued focus on complex synthesis tasks.                  |
| **2024** | Latest repository of benchmark files used in the 2024 contest. |
| **2025** | Two benchmark sets: one from 2022 and one with practical, unmodified circuit cases.0 |

**File Structure**

We provide truth tables related to a wide range of real-world logic problems by converting the IWLS to a unified version of truth tables, the file structure of the data we provided here is as follows:

```aiignore
real_world_IWLS/
├── IWLS.zip
|   ├── in{IN}_out{OUT}/
│   |   ├── *.truth
│   |   └── ...
├── iwls_data_converter.py
```

where`in{IN}_out{OUT}` denoted the number of inputs(`IN`) and outputs(`OUT`) of the networks inside,  and each `.truth` file corresponds to a single logic network.

**Scripts**

We provide a file named `iwls_data_converter.py` to convert original IWLS data to the unified truth files.

## Summary

| Dataset   | Description | Scale |
|-----------|-------------|-------------|
| synthetic | Scalable and Large-Scale AN/ANO logic networks | 2600 |
| BBM       | Biodivine Biological Networks(BBM) truth tables | 245 |
| IWLS      | IWLS Logic Synthesis Contest (2020-2025), all converted to a unified`.truth` format | < 500 |

---

## File Formats

**`.truth` File Format Description**

Each `.truth` file contains a partial or full truth table of a logic network, we construct the file structure as follows:

- The first **I rows** correspond to the **input combinations** (`I = number of input variables`).
- The next **N rows** correspond to the **output combinations** (`N = number of output variables`).
- Each row is a binary string of length `L`, where `L = min(2^I, partial_preserved)`:
    - `2^I` is the full number of input combinations.
    - `partial_preserved` is a pre-defined limit for large logic networks (e.g., 2^10, 2^15, 2^20).

This format ensures compact storage for large networks while preserving sufficient functional information for analysis.



**`.sbml` File Format**

The `.sbml` file (Systems Biology Markup Language) provides **metadata** for the biological network, including:

- The list of **input variables** (externally controlled nodes).
- Additional annotations and information for biological context.

In our workflow, the `.sbml` file is mainly used to extract the **input node names**, which are required for truth table generation.



**`.bnet` File Format**

The `.bnet` file defines the **Boolean update rules** for each node in a biological network. Each line specifies the logic equation of a node based on its regulators. 

The `.bnet` file describes the structure and dynamics of the network and is the core input for BBM simulation.



**AIGER Format(.aag/.aig)**

The AIGER format is a textual representation of And-Inverter Graphs (AIGs), widely used for hardware verification and synthesis. The core part of the file follows this structure:

- The **header**: `aag M I L O A`
   where
  - `M` is the maximum variable index
  - `I` is the number of inputs
  - `L` is the number of latches (often 0)
  - `O` is the number of outputs
  - `A` is the number of AND gates
- **Input lines** (I lines): each line contains one literal ID for an input
- **Output lines** (O lines): each line contains one literal ID for an output
- **AND gate lines** (A lines): each line defines an AND gate using three literals:
   `lhs rhs0 rhs1`
   where `lhs` is the output literal of the gate, and `rhs0` and `rhs1` are its two inputs. Inversion is encoded by making the literal odd (i.e., `x ⊕ 1`).

------

**Our Modification:**

In our modified version (used in ANO), we keep the same format and inversion mechanism for NOT. To extend support for both AND and OR gates, we add a **fourth column** to each gate line:

- `lhs rhs0 rhs1 type`
   where `type = 0` indicates an AND gate, and `type = 1` indicates an OR gate.

This simple extension allows us to distinguish between AND and OR gates while remaining largely compatible with the original AIGER structure.
