# Identification for Tree-Shaped Structural Causal Models

## Compiling and Running our Program
You can either compile and run our program locally if you have g++, make and OpenSSL installed, or you can use the docker container. An explanation on how to use Docker can be found below. Both locally and in Docker, our program can be compiled using the commands `make` or `make fast`. `make debug` compiles with debug options such as sanitizers. The executable will be called `fast.out`, the executable for debugging will be called `debug.out`. OpenSSL is used for the generation of random primes and must thus be installed before compiling (`apt install libssl-dev` in Ubuntu for example). Due to the usage of `__int128`, the program can only be compiled using `g++` at the moment.

After compilation, you can execute our program like this: `./fast.out < PATH_TO_INPUT_FILE` or by writing the input directly into the standard input following the format specified in the next section. To benchmark our program on a set of files contained in one directory, you can use the shell script `test_cpp.sh` as explained below.

Of course, treeID and SEMID can also be executed locally, but since they are more complex to install, we suggest using Docker to execute them.

## Usage of our Program
Usage: `./fast.out [OPTION]...`  
Reads the input from stdin and writes to stdout

Input format:  
$n$  
$p_1~\dots~p_{n-1}$  
$m$  
$u_0~v_0$  
$\dots$  
$u_{m-1}~v_{m-1}$  

where $n$ is the number of nodes, $m$ is the number of edges,
the nodes are numbered $0, …, n-1$,
$p_1, …, p_{n-1}$ are the directed parents of $1, …, n-1$
and ${u_0, v_0}, …, {u_{m-1},v_{m-1}}$ are the bidirected edges

Options:  
   ```
   --seed SEED        Seeds the random with SEED. Default is 42.
   --prime PRIME      Use PRIME as the prime. PRIME should be large enough, but
                      smaller than 2^60 to avoid overflows. Selects a random prime
                      of 59 bits by default.
   --verbose          Output extra information such as the rank of every missing edge
   --testing          Only output $n-1$ integers $i_1 … i_{n-1}$, $i_j∈{0,1,2}$
                      0 means unidentifiable, 1 means 1-identifiable, 2 means 2-identifiable
   --help             Display this help and exit
   ```

## Usage of Docker and Execution of Tests
We provide all files together with an environment in which all programs can be executed in docker. To run the docker container, you have to build the image using the provided Dockerfile (run `docker build . -t identification-for-tree-shaped-scms:1.0` in the folder in which the Dockerfile and all files lie) and then start the container (run `docker run -it identification-for-tree-shaped-scms:1.0`). Due to the installation of some R libraries, building the docker image may take a few minutes.

Within the docker container, we suggest first running `make all` to build our program and two programs that convert input files from our format to a format compatible with treeID or SEMID, respectively. treeID can be executed by running `bash run_js.sh INPUTFILE` where `INPUTFILE` is the path to a file in the input format described earlier. It might be necessary to modify or remove the command to execute the program on a specific core (`taskset --cpu-list 3`) and to change the time limit (`timeout 900`) and the memory limit (`--max_old_space_size=4096`). The shell script `run_js.sh` will automatically convert the given input file to JavaScript , run treeID on it and measure the execution time. There is a similar script `run_r.sh` to run the half-trek criterion from SEMID. 

To benchmark the programs while ignoring their outputs, there are three bash scripts `test_cpp.sh`, `test_js.sh` and `test_r.sh`. They can be executed by running `bash test_cpp.sh FOLDER_WITH_TEST_FILES`. Note that executing these scripts may take a long time. On our machine, our program took up to 45 seconds on a single test case, SEMID took up to 250 seconds on a single test case and treeID didn't terminate within the time limit or crashed on many tests.

There are also two bash scripts `test_eight_nodes_cpp.sh` and `test_eight_nodes_js.sh` to run our program and treeID on the 879 graphs with 8 nodes each. Note that while our program is quite fast, executing treeID on all those test cases takes a very long time (longer than 11 hours for us).

## Tested programs
### Our program
Our program is a C++ implementation of the algorithm described in the paper, consisting of the files `main.cpp`, `algebra.cpp`, `algebra.h`, `identification.cpp`, `identification.h`, `random.cpp` and `random.h`. The main logic is contained in `identification.cpp`.

Our program solved all test cases we tried correctly.

### treeID
treeID is the JavaScript implementation of  
Benito van der Zander, Marcel Wienöbst, Markus Bläser, and Maciej Liskiewicz. Identification in tree-shaped linear structural causal models. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 6770–6792. PMLR, 2022. URL https://proceedings.mlr.press/v151/van-der-zander22a.html.  

We downloaded the program in its current version 3.1 on April 8, 2025 from https://github.com/jtextor/dagitty. Analogously to what is done in their `Makefile` (`dagitty/jslib/Makefile`), we put together all relevant files into one file, the file  `treeID.js` which is also contained in our supplementary material. In addition to that, there is also a file `underscore-min.js` that has to be in the same folder to run the program. Since we didn't manage to get their parser running, to give input to treeID, we wrote a C++ program `test_to_js.cpp` that converts input files into JavaScript code defining the corresponding graph. The output of `test_to_js.cpp` is then concatenated with `treeID.js`, the obtained program can be executed using `node`. It then prints its output in a format analogous to what is output by our program when run with the flag `--testing`. All of this is done automatically by the scripts `run_js.sh` and `test_js.sh`, so we suggest using these scripts to run treeID.

### HTC (half-trek criterion) from SEMID
An R implementation of the half-trek criterion from  
Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linear structural equation models. The Annals of Statistics, 40(3):1682–1713, 2012.  
is contained in the R-package SEMID, we used the current version 0.4.1.

HTC is available in SEMID, an R package (https://rdocumentation.org/packages/SEMID/versions/0.4.1). It can simply be installed in R by running `install.packages('SEMID')` and is automatically installed when building the docker image from the provided `Dockerfile`. Similarly to what we did for treeID, we wrote a C++ program `test_to_r.cpp` that converts input files into R code defining the corresponding graph. The R code output by this program then automatically loads the library SEMID, defines the graph and runs the half-trek criterion on it. The output is the output that is printed to stdout by htcID. We suggest using the scripts `run_r.sh` and `test_r.sh` to run the half-trek criterion as they perform all steps described above automatically.

## Considered Test Cases
### eight_nodes
This folder contains 879 input files, `input1.txt` to `input879.txt`, each containing a graph with 8 nodes, together with 879 corresponding output files `output1.txt` to `output879.txt` in the format that is output by our program when executed with the flag `--testing`. In all of these graphs, the directed edges form a path $0\rightarrow1\rightarrow\dots\rightarrow(n-1)$. In none of these graphs, there are missing edges to the root.

These test cases are provided by:  
Benito van der Zander, Marcel Wienöbst, Markus Bläser, and Maciej Liskiewicz. Identification in tree-shaped linear structural causal models. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 6770–6792. PMLR, 2022. URL https://proceedings.mlr.press/v151/van-der-zander22a.html.

The half-trek criterion can provably not solve any of these test cases, see Appendix E in the paper. The execution times of our program and treeID on these test cases can be found in Figure 4 in the paper.

### large_cycle_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges are random in the sense that for each $i\in\{1,\dots,n-1\}$, its parent $p_i$ was chosen uniformly at random from $\{0,\dots,i-1\}$. Trees obtained like this are expected to be of logarithmic depth. The only missing edges form a cycle $1\leftrightarrow2\leftrightarrow\dots\leftrightarrow(n-1)\leftrightarrow1$. In all of the test-cases, all nodes are 2-identifiable by this cycle.

The half-trek criterion can provably not solve any of these test cases, see Appendix E in the paper. We still tested it on these test cases to compare the execution times. The execution times of all three programs can be found in Figure 5 in the paper.

### rand_all_zero_20_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges were generated using the same procedure as described for `large_cycle_in`. There are no missing edges to the root, but apart from that, we took a random subset of $20\%$ of all possible bidirected edges. Due to the large number of missing edges, in all of these test cases, there are identifying cycles of length 3, often even $1\leftrightarrow2\leftrightarrow3\leftrightarrow1$. Furthermore, in all of these test cases, all nodes are 1-identifiable because the equation of some missing edge is only satisfied for one of the options.

The half-trek criterion can provably not solve any of these test cases, see Appendix E in the paper. The execution times of treeID and our program can be found in Figure 8 in the paper.

### rand_all_zero_50_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges were generated using the same procedure as described for `large_cycle_in`. There are no missing edges to the root, but apart from that, we took a random subset of $50\%$ of all possible bidirected edges.

The half-trek criterion can provably not solve any of these test cases, see Appendix E in the paper. The execution times of treeID and our program can be found in Figure 9 in the paper.

### rand_all_zero_90_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges were generated using the same procedure as described for `large_cycle_in`. There are no missing edges to the root, but apart from that, we took a random subset of $90\%$ of all possible bidirected edges.

The half-trek criterion can provably not solve any of these test cases, see Appendix E in the paper. The execution times of treeID and our program can be found in Figure 10 in the paper.

### rand_normal_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges were generated using the same procedure as described for `large_cycle_in`. We took a random subset of $80\%$ of all possible edges, this time also allowing missing edges to the root. In all of these test cases, most of the nodes are (1-)identifiable, and all identifiable nodes are identifiable due to missing edges to the root.

The half-trek criterion solved all of these test cases. The execution times of all three programs can be found in Figure 7 in the paper.

### rand_normal_line_in
This folder contains input files with graphs that have between 5 and 200 nodes. The directed edges form a path $0\rightarrow1\rightarrow\dots\rightarrow(n-1)$. We took a random subset of $80\%$ of all possible edges, this time also allowing missing edges to the root. In all of these test cases, most of the nodes are (1-)identifiable, and all identifiable nodes are identifiable due to missing edges to the root.

The half-trek criterion solved all of these test cases. The execution times of all three programs can be found in Figure 6 in the paper.

## List of Files
- `Dockerfile`: File that describes an environment in which all programs can be executed
- `Makefile`: File that describes how to compile the C++ programs
- `README.md`: This file, containing detailed explanations of the supplementary material
- `algebra.cpp`: Part of our implementation
- `algebra.h`: Part of our implementation
- `identification.cpp`: Main part of our implementation
- `identification.h`: Part of our implementation
- `main.cpp`: Part of our implementation
- `random.cpp`: Part of our implementation
- `random.h`: Part of our implementation
- `run_js.sh`: Bash script to run treeID on a single test case, see earlier explanation
- `run_r.sh`: Bash script to run SEMID on a single test case, see earlier explanation
- `test_cpp.sh`: Bash script to run our program on all test cases in a given folder, see earlier explanation
- `test_eight_nodes_cpp.sh`: Bash script to run our program on the 879 graphs with 8 nodes each, see earlier explanation
- `test_eight_nodes_js.sh`: Bash script to run treeID on the 879 graphs with 8 nodes each, see earlier explanation
- `test_js.sh`: Bash script to run treeID on all test cases in a given folder, see earlier explanation
- `test_r.sh`: Bash script to run SEMID on all test cases in a given folder, see earlier explanation
- `test_to_js.cpp`: Program to convert input files from our format to a format compatible with treeID, can be compiled using `make test_to_js`
- `test_to_r.cpp`: Program to convert input files from our format to a format compatible with SEMID, can be compiled using `make test_to_r`
- `treeID.js`: JavaScript source code of treeID, put together in one file
- `underscore-min.js`: File that treeID needs to run
