\section{Experimental Setting}
\label{sec:experimental_setting}

\textbf{LLMs Used} \quad To test the social intelligence of models over lifelong interactions, we select LLMs capable of handling extremely long input lengths. The models chosen for this study include \textbf{Gemini-1.5}~\citep{geminiteam2023gemini}, \textbf{GPT-4o}~\citep{openai2024gpt4}, and \textbf{Llama-3.1}~\citep{dubey2024llama3herdmodels}. Gemini-1.5 can accommodate up to 1 million tokens as input, while both GPT-4o and Llama-3.1 can manage context lengths of up to 128k tokens. These capacities are sufficient for the experiments we intend to conduct.

\textbf{Evaluation} \quad As mentioned in section \ref{sec:framework:evaluation}, we use the \believability and \goalcompletion dimensions from \sotopiaeval, a third \believabilityextended dimension to aid the evaluation of \believability scores. The performance of the various language models is monitored on these dimensions over time. The evaluation is done for both sets of memory modules. The scores are compared against a human baseline, where humans participate with another LLM-based character. GPT-4 \citep{openai2024gpt4} is used as the primary evaluator model. Experiments were also run with Llama-3.1 \citep{dubey2024llama3herdmodels} as the evaluator, the results for which are present in Appendix \S \ref{appendix:llama}.

