<!DOCTYPE html>
<html>
<head>
    <meta charset='UTF-8'>
    <title>Supplementary material</title>
</head>
<body>
    <h1>Supplementary Material for Paper #9380</h1>

<h2>Contents</h2>
This archive contains several folders and files:
<ul>
	<li><b>bin</b> contains binary/executable pieces of code required by our scripts</li>
	<li><b>dataset</b> contains examples of datasets that can be used to test our scripts and to reproduce our results</li>
    <li><b>cnf_files</b> contains files that are generated when the script "generate_data_RF.py" is launched.
        The folder contains *.gcnf files used by the MUS extractor and *.wcnf files used by the Partial MaxSAT solver (you can already see how they look like for the "placement" dataset)</li>
	<li><b>plot_rf</b> contains the plot produced by running the "to_plot.py" script (you can already see how it looks like for the "placement" dataset)</li>
	<li><b>all_plot</b> contains plots (similar to those given in the paper) for each of the 15 datasets</li>
    <li><b>result_RF</b> is a folder containing .json files that can be generated by launching the command <code style="color:red;">python generate_data_RF.py {DatasetName} </code> (see below). As an example, the .json file that has been produced for the "placement" dataset is provided in the folder. In these .json files, the following keys correspond to:
		<ul>
			<li><t style="color:orange">acc:</t> The average accuracy over the 10 random forests produced by the cross validation process</li>
            <li><t style="color:orange">instance:</t> The list of binarized instances that have been considered in the experiments for the corresponding dataset and the random forests that have been learned. Instances 1 to 25 are those instances picked up in the test set of the first random forest, instances 26 to 50 are associated with the second random forest, and so on</li>
			<li><t style="color:orange">classified:</t>A list of tuples containing a Boolean value indicating whether the classifier succeeded in determining the right class of the corresponding instance, and a number (1 or 0) making precise this class </li>
            <li><t style="color:orange">len_bin:</t>A list providing for each random forest two successive numbers: the first one is the number of Boolean features used in the forest, and the second one is the number of original features in the dataset</li>
			<li><t style="color:orange">lime:</t> A list of tuples indicating for each instance the size of the LIME explanation that has been computed and the computation time needed to get it</li>
			<li><t style="color:orange">sufficient:</t> Same as LIME but for sufficient reasons (using MUS extractor) </li>
			<li><t style="color:orange">direct:</t> Same as LIME but for direct reasons </li>
			<li><t style="color:orange">majoritary:</t> Same as LIME but for majoritary reasons (using the greedy algorithm)</li>
			<li><t style="color:orange">10s:</t> A list containing the size of an explanation produced by the LHMS solver (partial MaxSAT approach) run with a time limit of 10 seconds</li>
			<li><t style="color:orange">60s:</t> Same as above but with a time limit of 60 seconds</li>
			<li><t style="color:orange">600s:</t> Same as above but with a time limit of 600 seconds </li>
            <li><t style="color:orange">reason:</t> A list of lists. Each list of this list contains the explanation that has been computed. The following order is used: direct, sufficient, majoritary, and approximation of a minimal majoritary reason for a timeout of 10 seconds</li>
			<li><t style="color:orange">hashmap:</t> A list of 10 hashmaps, one per random forest. Each hashmap contains pairs of the form: (index of the original feature, threshold) : [index of the corresponding Boolean feature in the forest, number of occurrences of this Boolean feature in the forest]. The literal associated with the Boolean feature is positive in an explanation if and only if the corresponding original feature is strictly greater than the corresponding threshold</li>
		</ul></li>
	<li><b>script</b> contains the scripts that have been written</li>
	<li><b>majoritary-C++</b> contains C++ code to be compiled in order to execute our Python script (not detailed here) </li>
	<li><b>environment.yml</b> This .yml file is here to help you reproduce our Python environment, which is mandatory to run our scripts
        <li><b>data_RF.ods</b> is a spreadsheet reporting for each dataset, some statistical information about the computations achieved over the 250 instances that have been considered.
        Here is a glossary to help you read it:
		<ul>
			<li><t style="color:purple">acc:</t> The average accuracy over the 10 random forests that have been generated</li>
			<li><t style="color:purple">nb_instances/nb_attributes :</t> The number of instances/attributes in the dataset</li>
			<li><t style="color:purple">nb_tree:</t> The number of trees in each random forest generated for this dataset</li>
			<li><t style="color:purple">avg_nb_bin:</t> The average number of Boolean features used in the 10 random forests that have been generated</li>
			<li><t style="color:purple">std_nb_bin:</t> The standard deviation corresponding to <t style="color:purple">avg_nb_bin :</t> </li>
			<li><t style="color:purple">med_*:</t>The median value (over the 250 instances) of the size of an explanation computed using the * approach  </li>
			<li><t style="color:purple">max_*:</t>The maximum value (over the 250 instances) of the size of an explanation computed using the * approach  </li>
			<li><t style="color:purple">nb_opt_approx_**s:</t>The number of ``truly'' minimal majoritary reasons (over the 250 instances) discovered by the approximation algorithm in at most **seconds (obviously enough, minimal majoritary reasons are discovered whenever no size reduction results from one step to the next one) </li>
		</ul></li>
</ul>


<h2>Software</h2>

<h4>How to set up our Python environment before running our scripts</h4>
<ul>
    <li>Be sure to use a Linux OS and to use a version of Python 3.x </li>
	<li>Install <a href="https://www.anaconda.com/products/individual">anaconda</a> </li>
	<li>Open a terminal in this repository with anaconda activated (If conda is activated, you will get "(base)" displayed on your terminal) </li>
	<li>Execute the command <code style="color:red;">conda env create --file env.yml</code> to clone our Python environment in your system</li>
	<li>Execute the command <code style="color:red;">./build.sh</code> in <code style="color:green">majoritary-C++</code>  <b>WARNING :</b> If this does not work, try to change "python3" to "python" in <code style="color:green">majoritary-c++/CMakeLists.txt</code> at line 41, then execute <code style="color:red;">./clean.sh</code> and <code style="color:red;">./build.sh</code> again </li>
	<li>Move the new folder <code style="color:green">build</code> currently in <code style="color:green">majoritary-C++</code> to <code style="color:green">script</code> folder </code></li>
	
</ul>

<h4>How use our scripts</h4>
<ul>
	<li>Prepare a dataset and store it in the <b>dataset</b> repository (or use one of the datasets available). Your dataset has to fulfill a .csv format, where the last column gives the label (class) of the instance (corresponding to the line) and all values are numerical ones (values of categorical features must be turned into numbers). Two classes are allowed only </li>
	<li>Set the number of trees per random forest for this dataset by updating <code style="color:green">script/info_data_RF.json</code>  </li>
	<li>Execute the command <code style="color:red;">python generate_data_RF.py {DatasetName} </code> to generate a .json file in <b>result_RF</b> containing the results</li>
    <li>Execute the command <code style="color:red;">python to_plot.py</code> to generate .pdf files containing plots in <b>plot_rf</b> for each .json file in <b>result_RF</b></li>
</ul>

<h4>Description of our scripts</h4> 
<h5>generate_data_RF.py</h5>
<p>This script aims to generate results following a 10-cross validation process, as explained in the experimental section of the
    paper.
    To run it, please enter the following command <code style="color:red;">python generate_data_RF.py {DatasetName}</code>
    (the dataset is selected as an argument of the script). Results are saved in a .json file to be found in the
    directory "json". You can select what you want to compute or not by commenting the corresponding instructions (lines 180 to 254).
    By default, the command computes for each instance, a sufficient reason, a majoritary reason, an approximation of a minimal majoritary reason
    (obtained after 10s) and a LIME explanation. To avoid memory outs or unkilled jobs in case of brutal stop, the code is
    currently sequential. If expected, you can parallelize it by augmenting the number of workers stored (the value of variable "nb_w" at line 323).
    <b>WARNING :</b> Take care that there is no repository corresponding to the one you generate in <b>cnf_files</b>. If there is one, then you have to delete the former files (this is a security to prevent from overwriting everything at each run of the script).
</p>

<h5>to_plot.py</h5>
<p>This script creates four plots for each .json file in <b>result_RF</b> and stores them in <b>plot_rf</b>. Yon can draw plots for specific datasets by changing lines 8/9
</p>

<h5>my_tree.py</h5>
<p>This script contains pieces of code to encode decision trees and to analyze them
</p>

<h5>my_forest.py</h5>
<p>This script contains pieces of code to encode forests of decision trees (particularly, random forest) and to analyze them
</p>

<h5>Other scripts</h5>
<ul>
	<li><b>encodage_CNF.py</b> contains pieces of code to encode propositional formulae into CNF formulae using Tseitin technique</li>
	<li><b>timeout.py</b> contains pieces of code to trigger a time-out exception</li>
</ul>
</body>

</html>
