% We use classical data structures implementations in C++ for our micro-benchmarks~\cite{cpp_ds}. It contains 9 data structures with associated unit tests: \CodeIn{AvlTree}, \CodeIn{BinarySearchTree}, \CodeIn{HashTable}, \CodeIn{Heap}, \CodeIn{LinkedList}, \CodeIn{Queue}, \CodeIn{RedBlackTree}, \CodeIn{Stack}, \CodeIn{Vector}. We modified the code to remove templates and fixed some implementation bugs to ensure the source code correctness. The following experiments are all run on this benchmark.

\begin{table}[!htp]\centering
\caption{Characteristics of the benchmark data structures. \# LoC represents lines of code, \# methods indicates the number of implemented methods, and \# dep. shows the number of dependent classes for each data structure.}\label{tab:benchmark-detail}
\small
\begin{tabular}{@{}l@{\hspace{8pt}}|rrrrrrrrr@{}}\toprule
&\multicolumn{1}{c}{avl\_} &\multicolumn{1}{c}{binary\_} &\multicolumn{1}{c}{hash\_} &\multicolumn{1}{c}{heap} &\multicolumn{1}{c}{linked\_} &\multicolumn{1}{c}{queue} &\multicolumn{1}{c}{red\_black\_} &\multicolumn{1}{c}{stack} &\multicolumn{1}{c}{vector} \\
&\multicolumn{1}{c}{tree} &\multicolumn{1}{c}{search\_tree} &\multicolumn{1}{c}{table} & &\multicolumn{1}{c}{list} & &\multicolumn{1}{c}{tree} & & \\\midrule
\# LoC &249 &229 &176 &135 &172 &117 &282 &96 &117 \\ 
\# methods &25 &22 &11 &14 &14 &12 &25 &11 &18 \\
\# dep. &1 &1 &0 &0 &0 &0 &1 &0 &0 \\
\bottomrule
\end{tabular}
\end{table}

We use a C++ implementation of classical data structures for our micro-benchmarks~\cite{cpp_ds}, which include 9 data structures with associated unit tests: \CodeIn{AvlTree}, \CodeIn{BinarySearchTree}, \CodeIn{HashTable}, \CodeIn{Heap}, \CodeIn{LinkedList}, \CodeIn{Queue}, \CodeIn{RedBlackTree}, \CodeIn{Stack}, and \CodeIn{Vector}.
Table~\ref{tab:benchmark-detail} shows the statistics of these benchmark data structures.
To ensure correctness, we thoroughly examined each benchmark example and corrected a few implementation bugs, treating this refined benchmark as the ground truth. All subsequent experiments are based on this benchmark setup.
\shuvendu{It would be interesting to mention how the bug was found with the generated invariants and tests, with the user in the loop to not remove an invariant that failed the test. Perhaps informally, in the Discussion section.}




% In addition to textbook examples, we also experimented on Z3~\cite{z3} util classes~\cite{z3util}: \CodeIn{ema}, \CodeIn{dlist}, \CodeIn{heap},\CodeIn{hashtable}, \CodeIn{permutation}, \CodeIn{scoped\_vector}, and the most complicated one \CodeIn{bdd\_manager}, which we will discuss later as a case study with Z3 author evaluation.

In addition to textbook examples, we also conducted experiments on utility classes~\cite{z3util} from \zthree~\cite{z3}, including \CodeIn{ema}, \CodeIn{dlist}, \CodeIn{heap}, \CodeIn{hashtable}, \CodeIn{permutation}, \CodeIn{scoped\_vector}, and the most complex class, \CodeIn{bdd\_manager}. The latter will be discussed in detail as a case study, including an evaluation with one of the authors of \zthree.

% \shuvendu{Provide number of lines of code for the benchmarks, number of methods and perhaps number of sub-classes.}
