This code is shared anonymously with the reviewers of NeurIPS'24 submission "Generative Forests".

This code must not be shared outside the review process, or by reviewers to other parties and must not be kept beyond the review process 
of NeurIPS'24.

This code is provided without any warranty. Use it at your own risks.

====================================================================================================================================================================

** General compilation & run:

Before anything, compile *twice*:

* run: ./compile_all.sh (twice to eventually get rid of some "classes not found" errors at first run -- this might leave some warnings but nothing that prevents the code from running)

for extensive details about options, run:

* run: "java Wrapper --help" to get a summary of the command line and explanations of parameters

====================================================================================================================================================================

** Datasets format:

Important: datasets just need to be in .csv format; no need to explicitly describe feature types (as in e.g. arff files) since our software 
recognises variable types automatically (note: types can be enforced, see the --help)

====================================================================================================================================================================

** Experiment: Missing Data Imputation (MDA)

* Dataset abalone provided in directory Datasets, formatted for MDA: 5 directories Set_*, for each fold. In each fold, a file 
abalone_impute5.csv simulates the whole domain with 5% missing data (MCAR, see submission). This is the file used by our software to 
(i) train a generative model, and then 
(ii) use it directly to impute data from the training file (do the same for ARF). Perror / RMSE (Table 6) computed by comparison 
with the original domain "abalone.csv".

* In east directory Set_*, 3 subdirectories output_samples/, results/, working_dir/ to save all statistics of the runs on each fold

** Running an example experiment MDA:

* run script: ./icml24_script-missing-data-imputation.sh (do not forget to edit paths inside)

The program details the features found in the domain (types, missing values, entropies, variance, etc.), then learns a generator 
(indicates the % and quantity mem used), then displays the generative model in extenso (Table 4) -- for example:

(name = #0.0 | depth = 8 | #nodes = 33)
[#0:root] internal {4177} (Height <= 0.234950 ? #1 : #2)
┣━[#1] internal {4013} (ShellWeight <= 0.170406 ? #3 : #4)
┃ ┣━[#3] internal {1410} (ShuckedWeight <= 0.067832 ? #9 : #10)
┃ ┃ ┣━[#9] leaf {275}
┃ ┃ ┗━[#10] internal {1135} (ShellWeight <= 0.075083 ? #13 : #14)
┃ ┃   ┣━[#13] leaf {239}
┃ ┃   ┗━[#14] internal {896} (ShellWeight <= 0.145867 ? #19 : #20)
┃ ┃     ┣━[#19] internal {642} (ShellWeight <= 0.089100 ? #25 : #26)
┃ ┃     ┃ ┣━[#25] leaf {115}
┃ ┃     ┃ ┗━[#26] internal {527} (ShellWeight <= 0.109896 ? #27 : #28)
┃ ┃     ┃   ┣━[#27] leaf {174}
┃ ┃     ┃   ┗━[#28] leaf {353}
┃ ┃     ┗━[#20] leaf {254}
┃ ┗━[#4] internal {2603} (ShellWeight <= 0.434832 ? #5 : #6)
┃   ┣━[#5] internal {2171} (ShellWeight <= 0.272511 ? #7 : #8)
┃   ┃ ┣━[#7] internal {987} (ShellWeight <= 0.240161 ? #15 : #16)
┃   ┃ ┃ ┣━[#15] internal {670} (ShellWeight <= 0.207701 ? #21 : #22)
┃   ┃ ┃ ┃ ┣━[#21] leaf {348}
┃   ┃ ┃ ┃ ┗━[#22] leaf {322}
┃   ┃ ┃ ┗━[#16] leaf {317}
┃   ┃ ┗━[#8] internal {1184} (ShellWeight <= 0.291797 ? #11 : #12)
┃   ┃   ┣━[#11] leaf {213}
┃   ┃   ┗━[#12] internal {971} (ShellWeight <= 0.325785 ? #17 : #18)
┃   ┃     ┣━[#17] leaf {322}
┃   ┃     ┗━[#18] internal {649} (ShellWeight <= 0.397043 ? #23 : #24)
┃   ┃       ┣━[#23] internal {484} (ShellWeight <= 0.368117 ? #29 : #30)
┃   ┃       ┃ ┣━[#29] leaf {312}
┃   ┃       ┃ ┗━[#30] leaf {172}
┃   ┃       ┗━[#24] leaf {165}
┃   ┗━[#6] internal {432} (ShellWeight <= 0.587253 ? #31 : #32)
┃     ┣━[#31] leaf {312}
┃     ┗━[#32] leaf {120}
┗━[#2] leaf {164}
Leaves: #31{312} #32{120} #17{322} #20{254} #29{312} #2{164} #9{275} #25{115} #22{322} #16{317} #21{348} #27{174} #11{213} #30{172} #24{165} #28{353} #13{239}.

Example: "[#1] internal {4013} (ShellWeight <= 0.170406 ? #3 : #4)" indicates that node #1 has a total weight of observations from R 
reaching it of 4013 (~number of rows in the training sample reaching the node; divide by 4177 to get a percentage). Its test is formulated C-style.

Then the program imputes the training file and saves the observations (subdirectory working_dir/, both in .csv and .txt format for further 
processing), eventually generates + saves additional data if asked and saves (subdir output_samples/) and saves additional stats in subdir 
results/ (tree statistics, etc.)

====================================================================================================================================================================

** Experiment: Density estimation (DE), plotting densities

* Dataset circgauss provided in directory Datasets, formatted for MDA: 5 directories Split_*, each containing the train / test 
(circgauss_train.csv / circgauss_test.csv) of our 5-fold CV (full domain in circgauss.csv). 

** Running an example experiment DE:

* run script: ./icml24_script-density-estimation-and-plots.sh (do not forget to edit paths inside)

The program details the features found in the domain (types, missing values, entropies, variance, etc.), then learns a generator from the 
training file (indicates the % and quantity mem used), then displays the generative model in extenso (Table 4); then the program generates 
data (if asked) and performs density estimation on the test file each 100 training iterations and saves them (expected density and expected 
log-likelihood saved in two separate files, density_estimation_likelihoods_*.txt, density_estimation_log_likelihoods_*.txt).

Note: change this in Algorithm.java, line public static int STEPS_TO_COMPUTE_DENSITY_ESTIMATION = 100;

Note the parts in the script "'--plot_labels={"X","Y"}'" and ""plot_type" : "data"". plot_labels indicates which variables to plot in a 
density plot (can provide a list, names must follow the .csv's first line). plot_type indicates which kind of plot to save: "data" means 
two types of plots are saved, both the density plots from the generated data 
(e.g. circgauss_gf_generated_*__X_X_Y_Y_jointdensity_plot_generated_*.png), a density plot of the whole domain for comparison (e.g. circgauss_gf_generated_*__X_X_Y_Y_jointdensity_plot_domaindensity.png). Note, you can replace ""plot_type" : "data"" by ""plot_type" : "all"", 
which saves in addition a frontier plot of >0 density regions projected on the XY chosen plane 
(e.g. circgauss_gf_generated_*__X_X_Y_Y_projectedfrontiers_plot.png). Note: this can take lots of time depending on the domain.

====================================================================================================================================================================

** Experiment: sole data generation (GEN)

* Dataset winered, formatted for MDA: 5 directories Split_*, each containing the train / test (circgauss_train.csv / circgauss_test.csv) 
of our 5-fold CV (full domain in circgauss.csv). The script contains, for each fold, the number of examples to generate to match the size 
of the test sample, which can then be compared with the test sample (Cf paper).

* run script: icml24_script-generate.sh (do not forget to edit paths inside)

====================================================================================================================================================================

Thank you. We look forward to interactions at rebuttal time.
