\section{TED Similarity Score}
\label{appendix:ted}
% \textbf{\textit{Creating trees:}}
\subsection{Creating Trees}

The \statement/ data structure can be viewed in many representations: hypergraphs, tree, table, records, and transforming the representation of this data structure in other formats is trivial.

In our setup, when represented as a tree, all nodes in a \statement/ has four attributes: \textit{name}, \textit{type}, \textit{value}, and \textit{parent}. We start a tree with the root node with name as `/root', type as `root', and no value. This node does not have any parent node. Next, the statement nodes emerge as branches from the root. Each statement node has a name like `/root/s0' or `/root/s2' (here, `s' indicates that this is a statement node and the number acts as an index), type as `statement', no value and the root node as its parent.  Further, attached to each statement node are predicate node(s) with names like `/root/s1/p0' or `/root/s0/p3', type as `predicate', no value and a statement node as its parent. Finally, in our current implementation, each predicate node has five children nodes attached to it. These leaf nodes can be of type: subject, subject-value, property, property-value, unit and the value attribute is populated with the actual value. The leaf nodes may have names like `/root/s2/p1/subject' or `/root/s0/p3/property-value'. In this representation, the name of a node completely determines the location of the node in a tree.

As an example, we show the tree structure for the statements shown in \cref{fig:text_table_to_statements}:

\begin{mdframed}[backgroundcolor=gray!10]
\small
\begin{verbatim}
Node('/root', type='root', value=None)
|-- Node('/root/s0', type='statement', value=None)
|   |-- Node('/root/s0/p0', type='predicate', value=None)
|   |   |-- Node('/root/s0/p0/Subject', type='Subject', value='Organization')
|   |   |-- Node('/root/s0/p0/Subject Value', type='Subject Value', value='XYZ')
|   |   |-- Node('/root/s0/p0/Property', type='Property', value='scope 1 emissions')
|   |   |-- Node('/root/s0/p0/Property Value', type='Property Value', value='3.3')
|   |   |-- Node('/root/s0/p0/Unit', type='Unit', value='million metric tons of CO2e')
|   |-- Node('/root/s0/p1', type='predicate', value=None)
|       |-- Node('/root/s0/p1/Subject', type='Subject', value='Organization')
|       |-- Node('/root/s0/p1/Subject Value', type='Subject Value', value='XYZ')
|       |-- Node('/root/s0/p1/Property', type='Property', value='time')
|       |-- Node('/root/s0/p1/Property Value', type='Property Value', value='2020')
|       |-- Node('/root/s0/p1/Unit', type='Unit', value='year')
|-- Node('/root/s1', type='statement', value=None)
    |-- Node('/root/s1/p0', type='predicate', value=None)
    |   |-- Node('/root/s1/p0/Subject', type='Subject', value='Organization')
    |   |-- Node('/root/s1/p0/Subject Value', type='Subject Value', value='XYZ')
    |   |-- Node('/root/s1/p0/Property', type='Property', value='scope 1 emissions')
    |   |-- Node('/root/s1/p0/Property Value', type='Property Value', value='2.5')
    |   |-- Node('/root/s1/p0/Unit', type='Unit', value='million metric tons of CO2e')
    |-- Node('/root/s1/p1', type='predicate', value=None)
        |-- Node('/root/s1/p1/Subject', type='Subject', value='Organization')
        |-- Node('/root/s1/p1/Subject Value', type='Subject Value', value='XYZ')
        |-- Node('/root/s1/p1/Property', type='Property', value='time')
        |-- Node('/root/s1/p1/Property Value', type='Property Value', value='2021')
        |-- Node('/root/s1/p1/Unit', type='Unit', value='year')
\end{verbatim}
\end{mdframed}


% \textbf{\textit{Computing Tree Edit Distance:}}
\subsection{Computing Tree Similarity Score}
\label{appendix:treesimilarityscore}

For comparing two statement trees, we setup strict costs for each edit operation. The predictions are maximally punished for any structural deviation from the ground truth, i.e. deletion and insertion each have a cost of 1. For renaming of the node's value attribute, we only allow two nodes to be renamed if they are of the same type. If both nodes' value attribute is of type string, then we calculate a normalized Levenshtein edit distance between the two strings.

If both nodes' value attribute is of numerical type, then the two values are directly compared. In this case, the cost is 0 if the two values are the same, and 1 in all other cases. If the value attribute of both the ground truth and the prediction node is empty, then the cost operation is also 0. We denote TED with $t$. We define normalized TED (nTED or $\overline{t}$) as the ratio of the distance to the number of edits between two trees. Using the normalized TED, a normalized Tree Similarity score can be computed as $t_{s} = 1 - \overline{t}$.

Consider comparing the trees for the two statements $s0$ and $s1$, from the example above. These two trees differ only in their numeric value but are otherwise similar to each other. Two edits are required to convert one tree into another: one corresponding to the property-value of `time' and the other corresponding to the property-value of `scope 1 emissions'. If the numeric values are interpreted as floats, then our strict setup will maximally punish for each edit giving an edit distance of 2 renaming, 0 deletions, and 0 insertions. The normalized tree edit distance (ratio of distance to total number of edits) would be 2 / 2 = 1. Thus, the TED similarity score would be 1 - 1 = 0. 

However, our model outputs numeric values as strings, which can be compared via normalized Levenshtein distance. Then, the first rename edit of year values will give a distance of 1/4 = 0.25, and the other rename edit will give a distance of 2/3 = 0.66. In this case, the total tree edit distance is 0.9166, the normalized tree edit distance is 0.4583. This gives a TED similarity score of 0.54. We will interpret this by saying that ``the two tree (when the numeric value are interpreted as strings) are 54\% similar to each other''. Given that the two trees are similar in their structure and only differ in their numeric values, this shows that our setup of TED similarity score is very strict. 

For illustrative purposes, let us consider another example. We consider that the $s0$ in the above example is the ground truth statement:

\begin{mdframed}[backgroundcolor=blue!10]
\small
\begin{verbatim}
Node('/root', type='root', value=None)
|-- Node('/root/s0', type='statement', value=None)
|   |-- Node('/root/s0/p0', type='predicate', value=None)
|   |   |-- Node('/root/s0/p0/subject', type='subject', value='Organization')
|   |   |-- Node('/root/s0/p0/subject_value', type='subject_value', value='XYZ')
|   |   |-- Node('/root/s0/p0/property', type='property', value='scope 1 emissions')
|   |   |-- Node('/root/s0/p0/property_value', type='property_value', value='3.3')
|   |   |-- Node('/root/s0/p0/unit', type='unit', value='million metric tons of CO2e')
|   |-- Node('/root/s0/p1', type='predicate', value=None)
|       |-- Node('/root/s0/p1/subject', type='subject', value='Organization')
|       |-- Node('/root/s0/p1/subject_value', type='subject_value', value='XYZ')
|       |-- Node('/root/s0/p1/property', type='property', value='time')
|       |-- Node('/root/s0/p1/property_value', type='property_value', value='2020')
|       |-- Node('/root/s0/p1/unit', type='unit', value='year')
\end{verbatim}
\end{mdframed}

And we have a model which makes the following prediction:

\begin{mdframed}[backgroundcolor=orange!10]
\small
\begin{verbatim}
Node('/root', type='root', value=None)
|-- Node('/root/s1', type='statement', value=None)
    |-- Node('/root/s1/p0', type='predicate', value=None)
        |-- Node('/root/s1/p0/subject', type='subject', value='Organization')
        |-- Node('/root/s1/p0/subject_value', type='subject_value', value='XYZ')
        |-- Node('/root/s1/p0/property', type='property', value='scope 2 emissions')
        |-- Node('/root/s1/p0/property_value', type='property_value', value='3.3')
        |-- Node('/root/s1/p0/unit', type='unit', value='million metric tons of CO2e')
\end{verbatim}
\end{mdframed}

We observe that the predicted tree is missing an entire predicate with time property. This happens when models stop generating new tokens. Compared to the previous example, the ground truth and model prediction have a major structural deviation. In addition, the model also made a mistake in the value of the `property' node. Instead of `scope 1 emissions' as in ground truth, the model predicted `scope 2 emissions'. 

To convert one tree into another, we need a total of 7 edits: six nodes need to be deleted (or inserted) (5 leaf nodes and 1 predicate node) and 1 renaming edit. All deletions or insertions have equal score of 1 each, and the renaming costs 1/17 $\approx$ 0.0588. The total tree edit distance becomes 6.0588, the normalized tree edit distance is 0.8655. This gives us a tree similarity score of 0.1344. We interpret that the two trees are only 13\% similar to each other.