\section{Introduction}
\label{sec:introduction}

Social interactions occur when two or more individuals (or agents) engage with one another, with each person's behavior being influenced by the actions of others \citep{REIS1991269, Turner1988}. These interactions are a fundamental part of human lives, as people continuously teach, learn, and converse with others throughout their lifetime~\citep{HARI2015181}. During such exchanges, individuals analyze the behavior of others, make inferences about their personalities, anticipate actions, and adjust their own behavior accordingly~\citep{German2020, pianesi_personality_2008}. This capacity to understand others' behavior, interpret their thoughts and feelings, and adapt one's own actions is known as \textit{social intelligence} \citep{marius_intelligence_2022, zhou2023sotopia, li2024socialintelligencedatainfrastructure}. People with high social intelligence are skilled at managing these interactions, especially as they are able to refine their communication methods by \emph{gaining more information} about the people they are interacting with. This allows them to achieve their desired outcomes in various social situations \citep{Holloway2020}.

Recent literature focuses on developing socially intelligent large language model (LLM)-based agents that can navigate social situations with human-like decision-making abilities~\citep{mathur2024advancingsocialintelligenceai, wang2024objectivelybenchmarkingsocialintelligence, park2023generative,zhou2023sotopia,wang2024sotopiapiinteractivelearningsocially}. Evaluating these agents has also been a major area of interest, with methods ranging from static text benchmarks \citep{sap2019socialiqacommonsensereasoningsocial, le-etal-2019-revisiting} and static video benchmarks \citep{Zadeh_2019_CVPR} to dynamic environments \citep{zhang2024buildingcooperativeembodiedagents, zhou2023sotopia}. However, a defining feature of human social interactions is their dynamic and lifelong nature, where the social goals of individuals change continuously, and they also gather new information about others to adjust their behavior accordingly. This requires reasoning about past interactions and adapting their responses, which will be useful for building a rich common ground between users and AI agents.

However, whether language agents are capable of navigating social scenarios and challenges over long time periods also remains an open question. To address this gap, we introduce the \lifelongsotopia benchmark (Figure \ref{fig:lifelongsotopia}), designed to evaluate language agents over lifelong social interactions. 

\lifelongsotopia simulates the interaction between pairs of characters through multiple \emph{episodes}. In each episode, two agents role-playing the characters will be assigned private social goals, and a shared social context. After each episode, the two agents will be evaluated based on their believability and whether they have achieved their social goals. To simulate lifelong interactions, we sample multiple episodes sequentially between two characters while providing them with a memory of their past interactions as context. Scenarios for these episodes are generated using GPT-4 (\S \ref{sec:framework}). The characters are role-played by LLM-based agents, including GPT-4o \citep{openai2024gpt4}, Gemini-1.5 \citep{geminiteam2023gemini}, Llama-3.1 \citep{dubey2024llama3herdmodels}, and also by humans to establish a baseline for ideal performance. We analyze the \believabilityFull (how believable the character's conversations are) and \goalcompletionFull (how successful the agent is at achieving its social goal) scores over time as the characters progress through episodes and their context increases.

The closest work to ours is Generative Agents \citep{park2023generative}, which demonstrates how LLMs and computational interactive agents can be combined to enable believable proxies of human behavior. Their evaluation shows that these agents produce credible individual and emergent social behaviors. However, the work mainly focuses on showcasing the abilities of LLMs at simulating social interactions rather than developing a systematic evaluation framework for these simulated interactions~\citep{zhou2023sotopia}. In contrast, our work focuses on benchmarking the performance of language agents in social intelligence. We achieve this by analysing their scores on the \believability and \goalcompletion dimensions (\S \ref{sec:framework:evaluation}), and provide insights into how these agents compare to humans.

Using our method to simulate lifelong social interactions, we aim to answer the following research questions:

\textbf{RQ1} (\emph{Consistency}): Can the models maintain consistency over long-term social interactions, staying true to their character?

\textbf{RQ2} (\emph{Social Intelligence}): Are the models capable of using information from previous episodes to optimize their goals in the current interaction, thus mimicking human behavior?

\textbf{RQ3} (\emph{Memory Utilisation}): Does equipping the models with an advanced memory improve their performance, and can they maintain this performance in harder social scenarios that require explicit use of memory?

Two different sets of experiments are conducted, varying the memory provided to the language agents from previous episodes. In the first approach, the entire prior interaction is provided as memory. In the second, a more advanced memory approach is implemented, where only specific knowledge gained in an episode — such as new strategies learnt or information gained about the other character - is retained while the rest of the conversation is filtered out to make the reasoning process easier for the language agents. Additionally, we test this advanced memory approach with hand-crafted scenarios, which are a more challenging version of the previously sampled scenarios. These scenarios require an explicit understanding of past conversations to evaluate whether the language agents can match human performance.

For \textbf{RQ1}, our findings indicate that model consistency declines when using the entire interactions as memory. Regarding \textbf{RQ2}, the declining trend in \goalcompletion for the simple memory module suggests that these language agents lack social intelligence, whereas humans consistently perform well across both dimensions. In response to \textbf{RQ3}, the model performance improves significantly upon using the advanced memory module. When tested on the harder scenarios, the agents maintain their consistency, but their performance on \goalcompletion declines significantly. Such a a trend highlights that these models fall short of humans in terms of social intelligence and utilizing past memories to achieve their social goals effectively.
\input{fig_tab_alg/lifelongsotopia}