\begin{wrapfigure}{r}{0.5\textwidth}
    \centering
    \includegraphics[width=0.5\textwidth]{figure/validation.pdf}
    \caption{Evaluation of \tech generated invariants}
    \label{fig:validation}
\end{wrapfigure}

\subsection{Correctness}
\label{subsec:correctness}

\tech produces \textit{filtered invariants}, which we evaluate using an automated pipeline (Figure~\ref{fig:validation}) against our benchmark (Section~\ref{sec:benchmark}). A \textit{filtered invariant} is considered correct if it reports no errors for any tests that successfully compile and run. Our manual review confirmed that all validated \textit{filtered invariants} are indeed correct, capturing essential properties of the data structures.

\parabf{Benefits of Co-Generation.}
When generating invariants in isolation, \tech produces an average of $25$ unique invariants per benchmark with a $77\%$ pass rate against unit tests. With test co-generation, \tech successfully eliminates all incorrect invariants, achieving perfect accuracy.

\parabf{Refinement Effectiveness.}
After refinement, the number of \textit{filtered invariants} grows from $17$ to $22$ per example, representing a $29\%$ increase. This demonstrates \tech's ability to transform potentially buggy invariants into valid ones through feedback-guided refinement.

\parabf{Summary.}
\tech's invariant-test co-generation approach improves correctness from $77\%$ to $100\%$. The \textit{filtering tests} effectively identify valid invariants, while the refinement process successfully expands the set of correct invariants.

\input{ablation_tables}

\subsection{Completeness}
\label{subsec:completeness}

\begin{table}[h!]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{3pt}
    \caption{\tech Performance Over Baseline for Previously Survived Mutants. The table shows additional mutants killed by \tech compared to the baseline and percentage improvement.}
    \begin{tabular}{lrrr}
        \toprule
        \textbf{Data Structure} & \textbf{Unsolved Base (\#(\%))} & \textbf{Add. by \tech (\#)} & \textbf{Impr. (\%)} \\
        \midrule
        binary\_search\_tree & 107(23.67) & 7 & 6.54 \\ 
        hash\_table & 258(37.83) & 38 & 14.73 \\ 
        heap & 108(32.24) & 12 & 11.11 \\ 
        linked\_list & 57(13.54) & 2 & 3.51 \\ 
        red\_black\_tree & 184(27.10) & 9 & 4.89 \\ 
        stack & 67(28.39) & 6 & 8.96 \\ 
        vector & 101(29.79) & 33 & 32.67 \\ 
        avl\_tree & 84(17.57) & 0 & 0.00 \\ 
        queue & 91(26.30) & 11 & 12.09 \\
        \midrule
        \textbf{Total} & \textbf{1057} & \textbf{118} & \textbf{11.16} \\
        \bottomrule
    \end{tabular}
    \label{table:specbot_baseline_comparison}
\end{table}

To evaluate completeness, we use mutation testing. This independent mutant-killing oracle mitigates the co-adaptation risk discussed in Section~\ref{sec:limitation}. We generate mutants using mutate\_cpp~\cite{mutatecpp}, producing between 236 and 682 mutants per program. We focus on mutants that either compile successfully but crash during execution or survive execution without errors.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{figure/test_vs_classinvgen.pdf}
    \caption{Completeness Experiment Result. The 3 bars from left to right are Tests, \tech, Tests+\tech. Tests+\tech kills the most mutants.}
    \label{fig:mutation}
\end{figure}

We conducted experiments to evaluate three configurations: unit tests only, \tech only, and unit tests with \tech (strongest test oracles). As shown in Table~\ref{table:specbot_baseline_comparison}, \tech's invariants kill an additional $11.2\%$ of mutants on average compared to unit tests alone, with improvements reaching up to $32.67\%$ for specific data structures.

Figure~\ref{fig:mutation} shows tests with \tech kill the most mutants. Figures~\ref{fig:inv_addition_kill} and \ref{fig:inv_addition_kill_const} show examples of mutants that survived unit tests but were killed by \tech invariants.

\begin{figure}[h]
\centering
\begin{lstlisting}[language=c++, escapechar=!]
 void HashTable::clear_table() {
   this->table.clear();
   this->_num_elements = 0;
!\CodeDelete\textbf{-}!  this->_size = 0;
!\CodeAdd\textbf{+}!  this->_size += 0;
 }
}
\end{lstlisting}
    \caption{Mutant that survived unit tests but killed by \tech}
    \label{fig:inv_addition_kill}
\end{figure}

 \begin{figure}[h]
\centering
\begin{lstlisting}[language=c++, escapechar=!]
   this->hash_function = hash_function;
   this->_num_elements = 0;
   this->_size = size;
!\CodeDelete\textbf{-}!  this->load_factor = 0.75;
!\CodeAdd\textbf{+}!  this->load_factor = -0.75;
   this->table =
       std::vector<std::shared_ptr<std::vector<std::pair<Key, Value>>>>(size);
 }
\end{lstlisting}
    \caption{Another Mutant that survived unit tests but was killed by \tech}
    \label{fig:inv_addition_kill_const}
\end{figure}

\subsection{Comparison of \tech v.s. Daikon}
\label{subsec:daikon_compare}

We compared \tech with Daikon~\cite{ernst2007daikon}, using \textit{filtering tests} to generate program traces for Daikon's invariant detector. On average, each benchmark example has around 5 Daikon invariants, with some being incorrect (Table~\ref{tab:daikon_invariants}).

\begin{table}[h!]
    \centering
    \small
    \caption{Daikon Incorrect Invariants per Benchmark}
    \begin{tabular}{lcc}
        \toprule
        \textbf{Data Structure} & \textbf{Total \# Invariants} & \textbf{ Incorrect Invariants} \\
        \midrule
        hash\_table           & 8 & 1 \\
        binary\_search\_tree  & 3 & 0 \\
        heap                  & 10 & 1 \\
        red\_black\_tree      & 2 & 0 \\
        avl\_tree             & 4 & 0 \\
        vector                & 3 & 1 \\
        stack                 & 6 & 2 \\
        queue                 & 7 & 1 \\
        linked\_list          & 4 & 1 \\
        \midrule
        \textbf{Average}    & \textbf{5.2} & \textbf{0.78} \\
        \bottomrule
    \end{tabular}
    \label{tab:daikon_invariants}
\end{table}

Through manual review, we identified 7 incorrect Daikon invariants that pass unit test validation. These invariants pass because both the \textit{filtering tests} and unit tests coincidentally constructed similar data structures.

\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{figure/daikon_classinvgen.pdf}
    \caption{Daikon vs. \tech Kills}
    \label{fig:daikon_specbot}
\end{figure}

Most Daikon invariants simply indicate that class pointers are not null (27 of 40 correct invariants) or that element counts are non-negative (6 invariants). The most valuable invariants, like \CodeIn{this->n < this->maxSize} in \CodeIn{Stack}, have more impact on identifying mutants (Figure~\ref{fig:daikon_specbot}).

This shows a key weakness of Daikon: it cannot differentiate between universally true invariants and those that hold only in specific test contexts. LLMs are better at capturing true "class" invariants that are inherent to the data structure rather than incidental to the tests.



