% program invariants utility
Invariants are predicates that hold on the program state for all executions of the program. Many invariants hold only at specific code locations.
For sequential imperative programs, it is useful to associate invariants with entry to a method (preconditions), exit from a method (postconditions), and loop headers (loop invariants).
Further, for stateful classes, class invariants are facts that hold as both preconditions and postconditions of the public methods of the class, in addition to serving as a postcondition for the class constructors for the class. 

These program invariants help make explicit the assumptions on the rest of the code, helping modular review, reasoning, and analysis. 
Program invariants are useful for several aspects of software construction and maintenance during the lifetime of a program.
First, executable program invariants can be enforced at runtime, where they provide an early indicator of state corruption, help with root causing, and allow a program to halt with an error instead of producing unexpected values. 
Runtime invariants serve as additional test oracles to amplify testing efforts to catch subtle bugs related to state corruption; this, in turn, helps with regression testing as the program evolves to satisfy new requirements. 
The utility of program invariants has led to design-by-contract in languages such as Eiffel~\cite{meyer1992eiffel}, as well as support in other languages such as Java (JML~\cite{jml}) and .NET (Code Contracts~\cite{fahndrich2010static}). 
Furthermore, for languages that support static formal verification (e.g., Dafny~\cite{leino2010dafny}, Verus~\cite{verus}, F*~\cite{swamy2011secure}, Frama-C~\cite{kirchner2015frama}), invariants can serve as a part of the specification, helping make formal verification modular and scalable. Unfortunately, invariants are underutilized because they require additional work and are sometimes difficult to write, so it would be useful to find a way to generate them automatically.
% \shuvendu{Review: However, it is well known that programmers do not like to undertake the effort to write such invariants, partly due to their unfamiliarity with the semantics of such specifications and partly due to the difficulty to discover them for large legacy modules.}

% why focus on object invariants
We focus specifically on automating the creation of class invariants for mainstream languages without first-class specification language support (e.g., C++) for several reasons:
\begin{itemize}
\item Class invariants are crucial for maintaining the integrity of data structures and help point to state corruption that may manifest much later within the class or in the clients.
Documenting such implicit contracts can greatly aid the understanding for maintainers of the class. 
\item Class invariants often form important parts of preconditions and postconditions for high-integrity data structures. Encapsulating such invariants and asserting them in preconditions and postconditions helps reduce bloat in the specifications. 
\item Class invariants are challenging for users to write, as writing them requires global reasoning across all the public methods for the class. 
\end{itemize}

For example, consider the class in Figure~\ref{fig:z3_dll} for a doubly-linked list, as implemented in the Z3 SMT solver.\footnote{\url{https://github.com/Z3Prover/z3/blob/master/src/util/dlist.h}}
\begin{figure}[htp]
    \centering

\begin{subfigure}[t]{0.55\linewidth}
\begin{lstlisting}[language=c++]
   void insert_before(T* other) {
        ...
        SASSERT(invariant());
        SASSERT(other->invariant());
        ...
        T* prev = this->m_prev;
        T* other_end = other->m_prev;
        prev->m_next = other;
        other->m_prev = prev;
        other_end->m_next = static_cast<T*>(this);
        this->m_prev = other_end;
        ...
        SASSERT(invariant());
        SASSERT(other->invariant());
        ...
    }
\end{lstlisting}
    
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.42\linewidth}
\begin{lstlisting}[language=c++,firstnumber=17]
    bool invariant() const {
        auto* e = this;
        do {
            if (e->m_next->m_prev != e)
                return false;
            e = e->m_next;
        }
        while (e != this);
        return true;
    }
\end{lstlisting}
\end{subfigure}
    \caption{An invariant in the doubly linked list class in Z3.}
    \label{fig:z3_dll}
\end{figure}

We see that the invariant is repeated four times: as a precondition and postcondition for the object instance \CodeIn{this} and for \CodeIn{other}.
The invariant is also non-trivial, requiring local variables and a loop.

% challenges in generating them. Static vs dynamic. 
Synthesizing program invariants has been an active line of research, with both static and dynamic analysis-based approaches. 
Static analysis approaches based on variants of \textit{abstract interpretation}~\cite{cousot1977abstract} and \textit{interpolation}~\cite{henzinger2004abstractions} create invariants that are sound by construction. 
However, such techniques do not readily apply to mainstream programming languages with complex language constructs or require highly specialized methods that do not scale to large modules, since the invariants need to be additionally \textit{provably inductive} to be retained. 
On the other hand, Daikon~\cite{ernst2007daikon} and successors learn invariants dynamically by instantiating a set of templates and retaining the predicates that hold on concrete test cases. 
While applicable to any mainstream language, it is well known that Daikon-generated invariants overfit the test cases and are not sound for all test cases~\cite{polikarpova2009comparative}.
Recent works have studied fine-tuning large language models (LLMs), to learn program invariants~\citep{pei2023learning} but these methods inherit the limitations of Daikon because their training data consists of Daikon-generated invariants. More importantly, the approach has not been evaluated on stateful classes to construct class invariants. 

% Why prior work is insufficient: LLMs have shown promise in spec generation for mainstream languages. Endres, SpecGen, GPCE paper. Work on verification aware languages as well. 
Recent work on \textit{prompting} LLMs such as GPT-4 to generate program invariants for mainstream languages~\cite{nl2postcond,greiner2024automated,ma2024specgen}  
has been used to generate preconditions, postconditions, and loop invariants, but these methods do not readily extend to generating class invariants. These pipelines work at single-loop or single-method scope and validate only scalar, intraprocedural predicates, so they lack the heap-aware, cross-method perspective required to state class invariants.
Further, these methods cannot construct expressive invariants that require iterating over complex data structures (such as in Figure~\ref{fig:z3_dll}) other than simple arrays. 

% 
In this work, we introduce \tech, a novel method for generating high-quality object invariants for C++ classes through \emph{co-generation} of invariants and test inputs using LLMs such as GPT-4o. 
We leverage LLMs' ability to generate code to construct invariants that can express properties over complex data structures.
The ability to consume not only the code of a class but also the surrounding comments and variable names helps establish relationships difficult for purely symbolic methods.
Since an LLM can generate incorrect invariants, the method also generates test inputs to \emph{heuristically} prune incorrect candidate invariants. 
%\shuvendu{What is the message here? LLMs are great at synthesizing the right invariant except hallucinations/subtle bugs? Are LLMs looking at more than variable names and exploiting its world knowledge about standard data structures? Is LLM looking at code or only comments/variable names? Are we only using compiler to provide feedback about syntax issues? Do the tests do anything more than pruning? How do we know LLMs are producing good invariants because they have memorized these canonical data structures taught in textbooks?}


We leverage the framework proposed by Endres et al.~\cite{nl2postcond} to evaluate the test-set correctness and completeness given a set of hidden validation tests and mutants. 
We contribute a new benchmark comprising standard C++ data structures along with a harness that can help measure both the correctness and completeness of generated invariants (Section~\ref{sec:benchmark}).
We demonstrate that \tech outperforms a pure LLM-based technique for generating program invariants from code (Sections~\ref{subsec:correctness}--\ref{subsec:completeness}) as well as prior data-driven invariant inference techniques such as Daikon (Section~\ref{subsec:daikon_compare}).
We also demonstrate its applicability for real-world code by performing a case study on a set of classes in the Z3 SMT solver codebase, including the relatively complex \CodeIn{bdd\_manager} class; the developers of the codebase confirmed most of the new invariants proposed by \tech for these modules (Section~\ref{sec:z3}). 

Our contributions are summarized below:
\begin{itemize}
    \item We introduce a new technique for invariant-test co-generation by combining simple static analysis with LLMs and implement an end-to-end prototype (Section~\ref{sec:approach}).
    \item We introduce a high-quality \tech-instrumented benchmark for evaluating object invariants  (Section~\ref{sec:benchmark}).
    \item We investigate LLM-assisted class invariant synthesis (Sections~\ref{subsec:correctness}--\ref{subsec:daikon_compare}).
    \item We conduct a case study on Z3 class modules using \tech (Section~\ref{sec:z3}).
\end{itemize}

\tech is conceived as a \emph{specification-drafting aid}: it produces candidate invariants that a developer can accept, refine, or discard, thereby following the long-advocated “human-in-the-loop’’ paradigm in specification mining\,\cite{newcomb2019usinghumanintheloopsynthesisauthor}. 
Our aim is not to replace expert judgement, but to accelerate it.



%\cy{Problem Statement}
%\livia{bad inv rate (label)}
%\livia{Assume we have temp correct code and in the future ppl want to change it. Or generating VC for verification. Moivation is to support code maintenance. Initial code working. Take a snapshot and develop invs. Eaiser for future changes. }

% \dave{But developers often don't write class invariants, because it takes more time and thought.  AI-based automatic generation of class invariants has the potential to reduce this cost, making the use of class invariants routine, which will in turn increase the effectiveness of testing. [Maybe other advantages and applications? Formal verification? Guiding test generation (manual or automatic)?]}

% --------- old intro bits (keeping them only for the comments) -----------

% Corrupt data structures cause incorrect or unpredictable program execution. Data structure repair updates corrupt data structures, enabling the program to continue to execute acceptably. However, it requires user-provided specifications.
% Daikon~\cite{ernst2007daikon}’s output has permitted the entire process to be automated while permitting human review. The result was more robust versions of programs, at reduced cost compared to manual generation [2].
% A related use is in programs that dynamically choose modalities in response to environmental inputs. Dynamically detected invariants from training runs can indicate which modalities are correlated with which external conditions.
% At runtime, occasionally overriding the program’s (inconsistent) actions improved the performance of programs in a real-world competition by over $50\%$ [27].

% The essential idea is to use a generate-and-check algorithm to test a set of potential invariants against the traces.
% Daikon reports those invariants that are tested to a sufficient degree without falsification. The accuracy of the results depends in part on the quality and completeness of the test cases. 
% Even modest test suites produce good results in practice [25,24], and techniques exist for creating good test suites for invariant detection [21,28,22].}


% \dave{Can you briefly summarize here, or early in the paper, why you think LLMs are better than other methods?  My guess is: Many methods work by proposing invariants and then filtering out the invalid ones based on tests.  Previous methods propose many trivial invariants, which has two problems.  First, there are so many combinations of trivial invariants that the really valuable ones never get generated.  Second, if there are too many trivial invariants, some are more likely to escape testing (which results in invalid invariants or requires more extensive testing). Third, correct but uninteresting invariants can clutter code, hampering readability, debugging, and test run times. On the other hand, LLMs can generate more interesting invariants before filtering with tests. LLM results are based on training on similar data structures, where there may be formal invariants or comments by developers explaining the reasoning behind the code, so they generate invariants similar to those that developers thought were important, and generate fewer spurious invariants that may slip through testing. }

% Instead of falsifying invariants produced by pre-set patterns, we determine likely program invariants by combining the concrete execution of actual test cases with a simultaneous symbolic execution of the same tests. The symbolic execution produces abstract conditions over program variables that the concrete tests satisfy during their execution.

%Assuming we have the correct source code, how do we infer high-quality invariants from it?


