\section{Crease Pattern evaluation system}
\label{app:eval_2}
This section introduces the complete evaluation process of the Crease Pattern . The final score is a weighted average of the scores from the different dimensions. Each of the four main dimensions is assigned an equal weight:
\begin{itemize}
    \item Topological Similarity: $w_{topological} = 0.25$
    \item Geometric Similarity: $w_{geometric} = 0.25$
    \item Foldability Constraint Satisfaction: $w_{foldability} = 0.25$
    \item Final Folded State: $w_{fold\_state} = 0.25$
\end{itemize}
The total score $S_{total}$ is calculated as:
$$ S_{total} = \sum_{dim} w_{dim} \cdot s_{dim} $$
Since $\sum w_{dim} = 1$ with these weights, this simplifies to:
$$ S_{total} = 0.25 \cdot s_{topological} + 0.25 \cdot s_{geometric} + 0.25 \cdot s_{foldability} + 0.25 \cdot s_{fold\_state} $$
where $s_{dim}$ is the score for a particular dimension.

\subsection{CP Structure Validation (\texttt{validate\_cp\_structure})}
This initial step ensures the generated CP data (\texttt{cp\_data}) is well-formed and meets basic criteria for a valid crease pattern.
\begin{itemize}
    \item \textbf{Presence of Basic Elements}: Checks if \texttt{"vertices\_coords"}, \texttt{"edges\_vertices"}, and \texttt{"faces\_vertices"} keys exist in the input.
    \item \textbf{Vertex Coordinates}: Each vertex in \texttt{vertices\_coords} must be a list of two numerical coordinates (e.g., \texttt{[x, y]}).
    \item \textbf{Edge Definitions}: Each edge in \texttt{edges\_vertices} must be a list of two integer vertex indices (e.g., \texttt{[v1, v2]}). These indices must be valid and within the bounds of the vertex list.
    \item \textbf{Crease Assignments (Optional)}: If \texttt{"edges\_assignment"} is present, each assignment must be one of the valid types: "B" (Boundary), "M" (Mountain), "V" (Valley), "F" (Flat), "U" (Unassigned).
    \item \textbf{Face Definitions}: Each face in \texttt{faces\_vertices} must be a list of at least three integer vertex indices. These indices must be valid.
    \item \textbf{Euler Characteristic}: For a planar graph, the Euler characteristic must satisfy $V - E + F = 2$, where $V$ is the number of vertices, $E$ is the number of edges, and $F$ is the number of faces.
    \item \textbf{Flat-Folder Validation (Optional)}: If the Flat-Folder \texttt{compute} module is available, its \texttt{validate\_cp\_structure(cp\_data)} API is called to check if the CP can be compiled into a valid origami model. If not, the CP is considered invalid.
\end{itemize}
If any of these checks fail, the function returns \texttt{\{"valid": False, "reason": "error message"\}}. Otherwise, it returns \texttt{\{"valid": True\}}.

\subsection{Topological Similarity (\texttt{calculate\_topological\_similarity})}
This dimension assesses the similarity of the graph-theoretical structure of the generated CP (\texttt{gen\_cp}) and the reference CP (\texttt{ref\_cp}). It combines scores from four sub-metrics, after extracting basic topological information using \texttt{extract\_topology(cp\_data)}, which retrieves vertices, edges, edge assignments, and faces.

The overall topological similarity score $S_{topological}$ is a weighted average defined within the \texttt{calculate\_topological\_similarity} method:
$$ S_{topological} = 0.2 \cdot s_{vertex} + 0.3 \cdot s_{edge} + 0.3 \cdot s_{face} + 0.2 \cdot s_{crease} $$

\subsection{Vertex Count Similarity (\texttt{compare\_vertex\_count})}
Compares the number of vertices ($V_{gen}$, $V_{ref}$).
\begin{itemize}
    \item If $V_{gen} = V_{ref}$, score $s_v = 1.0$.
    \item Otherwise, the score is calculated using an exponential decay function:
    $$ s_v = e^{-0.5 \cdot \frac{|V_{gen} - V_{ref}|}{\min(V_{gen}, V_{ref})}} $$
    (Note: The code implements this as $\exp(-0.5 \cdot (\max(V_{gen}, V_{ref}) - \min(V_{gen}, V_{ref})) / \min(V_{gen}, V_{ref}))$.)
\end{itemize}

\subsection{Edge Connectivity Similarity (\texttt{compare\_edge\_connectivity})}
Compares the edge structures based on degree distribution and connected components.
\begin{itemize}
    \item \textbf{Adjacency List Construction} (\texttt{build\_adjacency\_list}): Adjacency lists are built for both CPs from their edge-vertex relationships.
    \item \textbf{Degree Distribution Similarity}:
    \begin{itemize}
        \item \texttt{calculate\_degree\_distribution}: Computes the distribution of vertex degrees (number of edges connected to each vertex).
        \item \texttt{calculate\_wasserstein\_distance}: A simplified Wasserstein distance ($d_W$) is calculated between the degree distributions of the generated and reference CPs. The score for degree similarity is $s_{degree} = 1 - d_W$.
    \end{itemize}
    \item \textbf{Connected Components Similarity}:
    \begin{itemize}
        \item \texttt{count\_connected\_components}: The number of connected components ($C_{gen}$, $C_{ref}$) is determined for each CP graph using Depth First Search (DFS).
        \item If $C_{gen} = C_{ref}$, $s_{conn} = 1.0$.
        \item Otherwise, $s_{conn} = e^{-|C_{gen} - C_{ref}|}$.
    \end{itemize}
    \item The final edge connectivity score $s_{edge}$ is a weighted average: $s_{edge} = 0.7 \cdot s_{degree} + 0.3 \cdot s_{conn}$.
\end{itemize}

\subsection{Face Relations Similarity (\texttt{compare\_face\_relations})}
Compares properties of the faces in the two CPs.
\begin{itemize}
    \item \textbf{Face Count Similarity} ($s_{f\_count}$):
    $$ s_{f\_count} = e^{-\frac{|F_{gen} - F_{ref}|}{\max(1, \min(F_{gen}, F_{ref}))}} $$
    where $F_{gen}$ and $F_{ref}$ are the number of faces.
    \item \textbf{Average Vertices per Face Similarity} ($s_{f\_avg\_v}$):
    Let $avgV_{gen}$ and $avgV_{ref}$ be the average number of vertices per face.
    $$ s_{f\_avg\_v} = e^{-\frac{|avgV_{gen} - avgV_{ref}|}{\max(1, \min(avgV_{gen}, avgV_{ref}))}} $$
    \item \textbf{Face Size Distribution Similarity} ($s_{f\_dist}$):
    The distribution of face sizes (number of vertices per face) is computed for both CPs. A simplified Wasserstein distance ($d_W$) is calculated between these distributions using \texttt{calculate\_wasserstein\_distance}. The score is $s_{f\_dist} = 1 - d_W$.
    \item The final face relations score $s_{face}$ is a weighted average: $s_{face} = 0.3 \cdot s_{f\_count} + 0.3 \cdot s_{f\_avg\_v} + 0.4 \cdot s_{f\_dist}$.
\end{itemize}

\subsection{Crease Assignment Similarity (\texttt{compare\_crease\_assignment})}
Compares the distribution of crease types (M, V, B) if \texttt{"edges\_assignment"} is available.
\begin{itemize}
    \item If either CP lacks edge assignments, a low score of $0.2$ is returned.
    \item \textbf{Crease Type Counts} (\texttt{count\_crease\_types}): Counts the occurrences of Mountain ('M'), Valley ('V'), Boundary ('B'), Flat ('F'), and Unassigned ('U') creases.
    \item \textbf{Proportion Similarity}: For Mountain, Valley, and Boundary creases, the similarity of their proportions ($prop$) in the generated ($gen$) and reference ($ref$) CPs is calculated:
        \begin{itemize}
            \item Mountain: $s_M = 1 - |\text{prop}_{M,gen} - \text{prop}_{M,ref}|$
            \item Valley: $s_V = 1 - |\text{prop}_{V,gen} - \text{prop}_{V,ref}|$
            \item Boundary: $s_B = 1 - |\text{prop}_{B,gen} - \text{prop}_{B,ref}|$
        \end{itemize}
        where proportion is count of type / total number of assigned edges for that CP.
    \item \textbf{Length Penalty} ($p_L$): A penalty is applied if the total number of assigned edges differs:
    $$ p_L = \frac{\min(L_{gen}, L_{ref})}{\max(L_{gen}, L_{ref})} $$
    where $L$ is the total number of assigned edges.
    \item The final crease assignment score $s_{crease}$ is a weighted average of the proportion scores, multiplied by the length penalty:
    $$ s_{crease} = (0.4 \cdot s_M + 0.4 \cdot s_V + 0.2 \cdot s_B) \cdot p_L $$
\end{itemize}

\subsection{Geometric Similarity (\texttt{calculate\_geometric\_similarity})}
This dimension evaluates the similarity of the spatial characteristics of the compiled/folded models. It requires compiling the CPs into 3D models.
\begin{itemize}
    \item \textbf{CP Compilation} (\texttt{compile\_cp\_to\_model}):
        \begin{itemize}
            \item If the Flat-Folder \texttt{compute.compute\_folded\_state(cp\_data)} API is available, it's used to get the folded model data (typically including 3D vertex coordinates \texttt{"P"} and crease edges \texttt{"SP"}).
            \item If Flat-Folder is unavailable, a \texttt{simplified\_folding} method is used, which essentially returns the original 2D vertex coordinates as \texttt{"P"} and edges as \texttt{"SP"}. This is a significant simplification.
        \end{itemize}
    \item If either CP fails to compile (or provide simplified data), a low score of $0.2$ is returned by \texttt{calculate\_geometric\_similarity}.
\end{itemize}
The overall geometric similarity score $S_{geometric}$ is a weighted average defined within \texttt{calculate\_geometric\_similarity}:
$$ S_{geometric} = 0.4 \cdot s_{point} + 0.3 \cdot s_{angle} + 0.3 \cdot s_{size} $$

\subsection{Point Position Similarity (\texttt{compare\_point\_positions})}
Compares the 3D point clouds of the folded models.
\begin{itemize}
    \item \textbf{Coordinate Normalization} (\texttt{normalize\_coordinates}): Vertex coordinates (from \texttt{"P"}) of both models are normalized. If points are 2D, a Z-coordinate of 0 is added. Points are then translated so their centroid is at the origin and scaled so the maximum distance from the origin to any point is 1 (i.e., normalized to a unit sphere).
    \item \textbf{Bidirectional Hausdorff Distance} (\texttt{calculate\_bidirectional\_hausdorff}): The Hausdorff distance $d_H(A,B) = \max \left( \sup_{a \in A} \inf_{b \in B} d(a,b), \sup_{b \in B} \inf_{a \in A} d(a,b) \right)$ is calculated between the normalized point sets of the generated ($P_{gen}$) and reference ($P_{ref}$) models. $d(a,b)$ is the Euclidean distance. This is achieved by calling \texttt{calculate\_hausdorff\_distance} twice.
    \item The point position similarity score $s_{point}$ is calculated using an exponential decay function:
    $$ s_{point} = e^{-k \cdot d_H} $$
    where $k=5$ is a sensitivity coefficient.
\end{itemize}

\subsection{Angle Similarity (\texttt{compare\_angles})}
Compares the distribution of dihedral angles along creases in the folded models.
\begin{itemize}
    \item \textbf{Crease Edge Extraction} (\texttt{extract\_crease\_edges}): Crease edges are extracted from the folded model data (typically from \texttt{"SP"}).
    \item \textbf{Dihedral Angle Calculation} (\texttt{calculate\_dihedral\_angles}):
        \begin{itemize}
            \item \textbf{Note}: In the provided \texttt{eval.py}, if Flat-Folder is unavailable, this function returns a list of \textit{random angles} as a placeholder. A proper implementation would calculate actual dihedral angles between faces sharing a crease.
        \end{itemize}
    \item \textbf{Angle Histogram Comparison} (\texttt{compare\_angle\_histograms}):
        \begin{itemize}
            \item \texttt{create\_histogram}: Histograms of dihedral angles are created for both models. Angles are typically in $[0, 180^{\circ}]$, binned into 18 bins (10 degrees per bin).
            \item \texttt{calculate\_cosine\_similarity}: The cosine similarity between the two angle histogram vectors is calculated. This value serves as the angle similarity score $s_{angle}$.
        \end{itemize}
    \item If creases cannot be extracted or angles cannot be calculated for either model, a default score of $0.5$ is returned by \texttt{compare\_angles}.
\end{itemize}

\subsection{Size and Proportions Similarity (\texttt{compare\_size\_and\_proportions})}
Compares the overall dimensions and aspect ratios of the folded models' bounding boxes.
\begin{itemize}
    \item \textbf{Bounding Box Calculation} (\texttt{calculate\_bounding\_box}): The axis-aligned bounding box (min/max coordinates along X, Y, Z) is computed for the point clouds of both models. 2D points are padded with Z=0.
    \item \textbf{Proportion Calculation}: The dimensions (length, width, height) of the bounding boxes are calculated. These dimensions are sorted in descending order and then normalized by dividing by the largest dimension (e.g., $[1, L_2/L_1, L_3/L_1]$).
    \item \textbf{Similarity Score}: The cosine similarity between the normalized proportion vectors of the two models is calculated using \texttt{calculate\_cosine\_similarity}. This value is the size and proportions similarity score $s_{size}$.
    \item If either point set is empty, a default score of $0.5$ is returned by \texttt{compare\_size\_and\_proportions}.
\end{itemize}

\subsection{Foldability Constraint Satisfaction (\texttt{calculate\_foldability\_similarity})}
This dimension assesses whether the generated CP adheres to known origami foldability constraints, beyond basic geometric foldability.
\begin{itemize}
    \item \textbf{Basic Foldability Check (Optional)}:
        \begin{itemize}
            \item If Flat-Folder's \texttt{compute.check\_foldability(cp\_data)} API is available, it's used to check if both CPs are foldable.
            \item If the reference CP is foldable but the generated CP is not, the score for \texttt{calculate\_foldability\_similarity} returns $0.2$.
        \end{itemize}
\end{itemize}
The overall foldability score $S_{foldability}$ is a weighted average defined within \texttt{calculate\_foldability\_similarity}:
$$ S_{foldability} = 0.3 \cdot s_{TT} + 0.3 \cdot s_{TTo} + 0.2 \cdot s_{Trans} + 0.2 \cdot s_{flatfold} $$
If an exception occurs during calculation, \texttt{calculate\_foldability\_similarity} returns a score of $0.3$.

\subsection{Specific Origami Constraint Comparison}
This involves extracting and comparing critical origami constraints.
\begin{itemize}
    \item \textbf{Constraint Extraction} (\texttt{extract\_constraints}):
        \begin{itemize}
            \item This method aims to extract Taco-Taco (\texttt{TT}), Taco-Tortilla (\texttt{TTo}), and Transitivity (\texttt{Trans}) constraints by calling helper methods like \texttt{extract\_taco\_taco\_constraints}.
            \item \textbf{Note}: In the provided \texttt{eval.py}, if Flat-Folder's \texttt{constraints} module is unavailable, the extraction methods are simplified and return empty lists. A full implementation would identify these constraints from the CP geometry and crease assignments.
        \end{itemize}
    \item \textbf{Constraint Set Comparison} (\texttt{compare\_taco\_taco\_constraints}, \texttt{compare\_taco\_tortilla\_constraints}, \texttt{compare\_transitivity\_constraints}):
    For each constraint type (TT, TTo, Trans):
        \begin{itemize}
            \item If both CPs have no such constraints, similarity is $1.0$.
            \item If one has constraints and the other doesn't, similarity is $0.3$.
            \item Otherwise:
                \begin{itemize}
                    \item \textbf{Constraint Overlap} ($s_{overlap}$): Calculated using Jaccard similarity on the sets of constraints (constraints are stringified for comparison via \texttt{calculate\_constraint\_overlap}).
                    $$ J(A,B) = \frac{|A \cap B|}{|A \cup B|} $$
                    \item \textbf{Count Similarity} ($s_{count}$):
                    $$ s_{count} = e^{-\frac{|N_{gen} - N_{ref}|}{\max(1, \min(N_{gen}, N_{ref}))}} $$
                    where $N$ is the number of constraints of that type.
                    \item The score for that constraint type (e.g., $s_{TT}$) is $0.7 \cdot s_{overlap} + 0.3 \cdot s_{count}$.
                \end{itemize}
        \end{itemize}
\end{itemize}

\subsection{Local Flat-Foldability Conditions (\texttt{compare\_flat\_foldability})}
Checks for adherence to local flat-folding theorems around vertices.
\begin{itemize}
    \item \textbf{Kawasaki's Theorem Check} (\texttt{check\_kawasaki\_theorem}):
        \begin{itemize}
            \item States that for a flat-foldable vertex, the sum of alternating angles around the vertex is $180^{\circ}$, or equivalently, $\sum \alpha_i = 2\pi$ (or 0, depending on how angles are measured like $\sum (-1)^i \alpha_i = 0$).
            \item \textbf{Note}: The mock implementation in \texttt{eval.py} always returns \texttt{True}. A full implementation would iterate internal vertices and check angles.
        \end{itemize}
    \item \textbf{Maekawa's Theorem Check} (\texttt{check\_maekawa\_theorem}):
        \begin{itemize}
            \item States that for a flat-foldable vertex, the number of mountain creases ($M$) and valley creases ($V$) must differ by two: $|M - V| = 2$.
            \item \textbf{Note}: The mock implementation in \texttt{eval.py} always returns \texttt{True}. A full implementation would check crease assignments around internal vertices.
        \end{itemize}
    \item \textbf{Scoring}:
        \begin{itemize}
            \item Kawasaki score ($s_{K}$): $0.2$ if reference theorem status is True and generated is False, $1.0$ otherwise.
            \item Maekawa score ($s_{M}$): $0.2$ if reference theorem status is True and generated is False, $1.0$ otherwise.
        \end{itemize}
    \item The final flat-foldability score $s_{flatfold} = 0.5 \cdot s_{K} + 0.5 \cdot s_{M}$.
\end{itemize}

\subsection{Final Folded State Similarity (\texttt{compare\_final\_folded\_state})}
This dimension directly compares the 3D geometry of the final folded shapes compiled from the generated and reference CPs.
\begin{itemize}
    \item \textbf{CP Compilation}: Similar to geometric similarity, \texttt{compile\_cp\_to\_model} is used. If compilation fails for either (returns falsy), \texttt{compare\_final\_folded\_state} returns a score of $0.3$.
    \item \textbf{Point Cloud Extraction}: 3D vertex coordinates (\texttt{"P"}) are extracted from the compiled models. If point clouds are missing for either, a score of $0.3$ is returned.
\end{itemize}
The overall final folded state score $S_{final\_state}$ is a weighted average defined within \texttt{compare\_final\_folded\_state}:
$$ S_{final\_state} = 0.7 \cdot s_{shape} + 0.3 \cdot s_{layer} $$
If an exception occurs during calculation, \texttt{compare\_final\_folded\_state} returns $0.3$.

\subsection{Overall Shape Similarity}
\begin{itemize}
    \item Calculated using the bidirectional Hausdorff distance $d_H$ between the (normalized) point clouds of the generated and reference folded models, identical to the method in \texttt{compare\_point\_positions}.
    \item The shape similarity score $s_{shape}$ is:
    $$ s_{shape} = e^{-5 \cdot d_H} $$
\end{itemize}

\subsection{Layering Similarity (\texttt{compare\_layers})}
Compares the stacking order of faces/layers in the folded state.
\begin{itemize}
    \item This relies on layering information being present in the compiled model, typically under a key like \texttt{"CF"} (face assignments or configuration).
    \item \textbf{Note}: The \texttt{compare\_layers} function in the provided \texttt{eval.py} is a simplified placeholder and returns a default score of $0.5$. A full implementation would require a detailed comparison of the layer graph or face ordering.
    \item The score is $s_{layer}$.
\end{itemize}