
% \livia{do not use "sound" per Alex's sugguestion}
% \livia{possible attacks to mutation: what if mutants are eq as original}


In this section, we evaluate the quality of \tech invariants. 
Specifically, we explore the following research questions:
\begin{enumerate}
    \item How many of these invariants are \textbf{correct} (with respect to the user-provided test cases) and do they capture essential properties of the source code (Section~\ref{subsec:correctness})?
    \item How \textbf{complete} are the invariants in their ability to distinguish the correct program from buggy counterparts (Section~\ref{subsec:completeness})? 
    \item How does \tech compare to a state-of-the-art technique in invariant generation (namely Daikon, the most widely adopted tool for dynamic invariant synthesis) (Section~\ref{subsec:daikon_compare})?
\end{enumerate}


Our experiments were conducted on a machine with 24 CPU cores and 64 GB of RAM. We implemented \tech using GPT-4o as the underlying LLM, with its default temperature setting of 1.
% We investigate the following research questions in our experiments to evaluate our approach:
% \begin{itemize}
%     \item \textbf{RQ1:} What is overall performance of \tech under different configurations? (Ablation, soundness)
%     \item \textbf{RQ2:} How discriminating are the generated invariants? (Mutation, completeness)
%     \item \textbf{RQ3:} How does \tech compare with different baselines such as Daikon\cy{TODO}?
% \end{itemize}


% \dave{The first most important question is how well it achieves the goal of the method.  How many class invariants does it generate?  How many of them are correct (define what that means).  Based on inspection, do the invariants capture important properties?}

% \dave{There isn't much ablation going on here, is there?  Usually, I see a handfull of features and combinations deleted, but this is basically with and without \tech. I think it would read better if you just said that and dropped "ablation"}

