\appendix

\section{Limitations}
\label{appendix:limitations}
\textbf{Design of the harder social scenarios} \quad The harder social scenarios were manually crafted based on the previously sampled set of scenarios. This method has obvious limitations as it requires human intervention is not scalable. Future work can come up with ways to automate this process.

\textbf{Potential social biases in the environments} \quad We utilise various LLMs like GPT-4, Gemini-1.5 and Llama-3.1 for simulating human interactions as well as the evaluation of these conversations. These LLMs may contain potential social biases and stereotypes which would then reflect in the interactions as well as the evaluation scores in \lifelongsotopia. 

\section{Ethical Statement}
\label{appendix:ethical}
Attributing human traits to AI systems can lead to anthropomorphizing them, could not only create unrealistic expectations, but also enable manipulation and cause negative consequences \citep{deshpande2023anthropomorphizationaiopportunitiesrisks}. In \lifelongsotopia, the AI agents do not maintain a consistent human identity but instead are made to role-play different characters across various scenarios. This role-playing approach helps prevent the development of consistent human-like personalities in AI, thereby reducing the risk of anthropomorphism \citep{shanahan2023roleplaylargelanguagemodels}. The main objective of \lifelongsotopia is to evaluate the social intelligence of language agents over lifelong social interactions, and in no way do we intend to create AI agents that are similar to humans or cause any potential global risks \citep{inbook}. Enhancing these language models with greater social intelligence may lead to possible social manipulation. It is critical to note that we do not endorse the use of \lifelongsotopia to create manipulative agents.

\section{Prompt Details}
\label{appendix:prompts}
In this section, we provide the prompts utilised for various purposes in \lifelongsotopia. 

\subsection{Prompts for evaluation}
\label{appendox:prompts:evaluation}
Following are the prompts used for the calculating scores on the 3 main dimensions we evaluate the language agents on, i.e. \believabilityextendedFull, \goalcompletionFull, and \believabilityextendedFull. These are the prompts fed to the evaluator models. All results in the main paper use GPT-4 as the evaluator. Results with Llama-3.1 as the evaluator are presented in Appendix \S \ref{appendix:llama}.

\believability
\begin{quote}\tt
    Reasoning requirement: 1. Evaluate if the agent interacts with others in a natural and realistic manner (here are a few common questions to check: a. whether the agent is confusing with its own identity? b. whether the agent repeats others' words/actions without any reason? c. whether the agent is being overly polite considering the context?). Start the analysis with tag <naturalness> 2. Analyze whether the actions of the agent align with their character traits (e.g., personality, values, and etc.). Start the analysis with tag <consistency>. Output your reasoning process to the `reasoning' field. Output an integer score ranging from 0 and 10 in the 'score' field. A higher score indicates that the agent is more believable.
\end{quote}
\goalcompletion 
\begin{quote}\tt
Please first reiterate agent's social goals. And then please provide a comprehensive analysis about the extent to which the agent has managed to achieve these goals. In the `reasoning' field, provide a comprehensive account of the logic or thought process that led you to your conclusion. Further, provide an integer score ranging from 0 and 10 in the `score' field. 0 represents minimal goals achievement, 10 represents complete goal achievement, and a higher score indicates that the agent is making progress towards their social goals.
\end{quote}
\believabilityextended
\begin{quote}\tt
    Given the following checklist, please evaluate the conversation of the agent on each of the checkpoints. The checklist is as follows: checkpoint 1: There should be no repetition of sentences by the agent in the conversation. The agent fails on this checkpoint (score = 0) if there are instances in the conversation where the agent repeats the same sentence (the sentences dont necessarily have to match word for word, pay attention to what the gist of the sentence was) or expresses the same sentiment again and again. This could happen over 2-3 or even more turns. For example an agent saying 'Yes! I cannot wait to do this!' and then saying 'That's amazing! I am looking forward to doing this with you' in successive turns is a case of repetition. There could be other similar cases, make sure to identify them. checkpoint 2: The agent is consistent with their character traits provided at the start of the episode. They should also not confuse their identity with that of the other agent. checkpoint 3: The conversation aligns with the goals of the agent in the scenario. The conversation should be focussed on achieving these social goals. The agent should also not confuse their own goals with those of the other agent. checkpoint 4: The agent does not continue the conversation unnecesarily and leaves promptly after their goal resolution. This is indicated at the end of the conversation by '[Agent Name] left the conversation. If the agent continued to converse for several turns even though they had already achieved their goal, then this should be marked as 1. checkpoint 5: The agent does not repeat their exact goals as sentences in the conversation thus displaying realism in their speech. For this you need to compare their goals in the scenario and their conversation and evaluate if they exactly repeat the sentences or not. checkpoint 6: The agent does not stall in a conversation without completing their goals i.e. there are no 'do nothing' actions for multiple turns. checkpoint 7: The agent responses are directly in response to the other agent's dialogue. checkpoint 8: The beginning of the conversation is not abrupt and related to the current scenario. Output a list of integers in the 'score' field. Each item in the list is a score for that particular checkpoint. For example, the 1st item is for 'checkpoint 1', 2nd item is for 'checkpoint 2', and so on. In total the length of the list will be 8 for the 8 checkpoints. Each item in the list of scores is a binary integer score of 0 or 1: '0' if the agent fails on that checkpoint i.e. the conversation does not match the checkpoint's requirements and '1' if the agent passes the checkpoint i.e the conversation matches the checkpoint's requirements.
\end{quote}

\subsection{Prompt for generating Scenarios}
\label{appendix:prompts:envprofiles}

Following is the prompt used to generate new scenarios, while using past datapoints from the \sotopia database as few-shot examples.

\begin{quote}\tt
    Please generate scenarios and goals based on the examples below as well as the inspirational prompt, when creating the goals, try to find one point that both sides may not agree upon initially and need to collaboratively resolve it. 
    Inspirational prompt: <the selected vignette>
    Examples: <5 examples from \sotopia>
\end{quote}
The inspirational prompt is chosen in the same way as done in the \sotopia paper.

\subsection{Prompt for generating a summary of the episode}
\label{appendix:prompts:summary}
Following is the prompt for generating a summary of the episode. When implementing the advanced memory module, these generated summaries are provided as memory of each episode, rather than the entire interaction.
\input{fig_tab_alg/summary/fig_prompt}
\clearpage
\section{Qualitative examples from \lifelongsotopia}
\label{appendix:qual}

\subsection{\believabilityextended checkpoints and failure cases of language agents}
\label{appendix:qual_belext}

In this section, we provide episodes generated during our experiments which serve two purposes: (A) They show cases where GPT-4 initially failed as an evaluator for \believability, and were thus used to build the checklist in \believabilityextended. (B) They also showcase examples where the language agents fail at using past information to achieve their social goals, displaying inconsistency and a lack of social intelligence.

\input{fig_tab_alg/bel_extended/fig_checkpoints}
\clearpage

\subsection{Human Performance in \lifelongsotopia}
\label{appendix:qual:human}
In this section, we provide examples on how humans were able to make better use of their memory from past interactions to achieve their future social goals.

\input{fig_tab_alg/human_qual/simple}

\clearpage


\section{Harder hand-crafted scenarios}
\label{appendix:hard_scenarios}
In this section, we give details on the harder social scenarios that we craft manually. These require an explicit understanding of the previous interactions by the characters. They not only test the memory of the language agents by expecting them to recall a past interaction they had with the other character, but they also require them to use negotiation strategies or information about the other character learnt in the past to be able to fully achieve their social goals. Furthermore, we also explain how humans were able to maintain their goal completion scores on these scenarios by employing better techniques and strategies, which the LLM-based agents couldn't.

\subsection{Details about the scenarios}
\label{appendix:hard_scenarios:details}
\input{fig_tab_alg/new_scenarios/main}

\clearpage

\subsection{Performance of humans on these harder scenarios}
\label{appendix:hard_scenarios:humans}
\input{fig_tab_alg/human_qual/hard}

\section{Llama-3.1 as the evaluator}
\label{appendix:llama}
\input{fig_tab_alg/llama_eval/main}

We also evaluated the use of Llama-3.1 as an evaluator for the generated episodes to determine if it could replace GPT-4. As shown in Figure \ref{fig:llama:main}, Llama-3.1 struggles to effectively differentiate between successful and unsuccessful language agent performances for both \believability and \goalcompletion. Consequently, Llama-3.1 is unsuitable for use as an evaluator, and we retain GPT-4 for our main experiments.


\review{
\section{Performance of Models Without Memory in Harder Scenarios}
In this section, we evaluate how a model performs in the \textbf{harder scenarios} of \lifelongsotopia when it is not provided with any memory of past interactions. For this analysis, we use \textbf{GPT-4o} as the base LLM. Each of the five handcrafted harder scenarios is tested over 10 iterations, and the model's performance is evaluated using \believabilityFull and \goalcompletionFull scores. The results, summarized in Table \ref{tab:no_memory}, reveal that for both \believability and \goalcompletion scores, we observe a noticeable performance drop in the memory-less model compared to memory-equipped models. This is because the harder scenarios are explicitly conditioned on prior episodes in the \lifelongsotopia dataset. These scenarios require the agent to utilize information from past interactions effectively to achieve its goals. Without access to this memory, GPT-4o struggles to leverage context from prior episodes, resulting in lower performance on both \believability and \goalcompletion.
}
\input{fig_tab_alg/no_memory/tab}

\review{\section{Ablation Study: Impact of Different Components of the Memory Module}}
\label{appendix:summary_ablation}

\review{We conducted an ablation study to analyze how the performance of the advanced memory module is influenced by variations in two key components: the length of the episode summary and the inclusion or exclusion of specific aspects in the memory summaries. Below, we detail the results for each of these ablations.}

\review{\subsection{Length}}
\label{appendix:summary_ablation:length}

\review{To evaluate the effect of summary length on model performance, we experimented with three configurations: very short summaries (50 words), medium-length summaries (300 words), and long summaries (1000 words). The results are presented in Figure~\ref{fig:summary:length}.}

\review{The performance trends reveal that medium-length summaries (300 words) yield the best results across both \believability and \goalcompletion metrics. Very short summaries may often miss out on critical details, leading to a drop in both consistency and goal completion. Conversely, long summaries introduce irrelevant or redundant information, similarly degrading performance. As can also be seen in the plot, the performance degradation when using very long summaries is much worse than for using shorter summaries. These results underscore the importance of a balanced summary length in maintaining both consistency and goal-directed behavior.}

\input{fig_tab_alg/summary/length}

\review{\subsection{Aspects}}
\label{subsec:aspects}

\review{To investigate the role of different aspects in the memory summaries, we performed an ablation study by systematically removing each of the three components—(1) the episode overview, (2) negotiation strategies, and (3) information about the other character. Figure~\ref{fig:summary:aspects} summarizes the results of this study.}

\review{The ablations demonstrate that excluding any single aspect results in a noticeable drop in performance. Excluding any of the three aspects leads to a significant drop in performance. On the other hand, when all three aspects are included, the agents achieve their best performance, reinforcing the design choice of incorporating these specific elements into the memory module.}

\input{fig_tab_alg/summary/aspects}


\review{
\section{Evaluation Results on All Dimensions in \sotopiaeval}
}
\label{appendix: all_dims}

\review{
As mentioned in Section \ref{sec:background:sotopia}, authors in the \sotopia paper come up with a 7-dimensional evaluation framework to comprehensively evaluate the social interactions of agents. These seven dimensions are as follows: 

\believabilityFull (\believability) [0-10]: It focuses on the extent to which the character’s behavior is perceived as natural, realistic, and aligned with their profile, thus simulating believable proxies of human behavior.

\goalcompletionFull (\goalcompletion) [0-10]: This evaluates the extent to which the character achieved their goals defined in the environment. 

\knowledgeFull (\knowledge) [0-10]: This dimension assesses the agent's ability to actively acquire new information during interactions.

\secretFull (\secret) [-10-0]: This captures the agent’s capability to keep private or secretive information hidden during social interactions.

\relationshipFull (\relationship) [-5-5]: This evaluates whether the relationship between the agents improves or deteriorates after each interaction, reflecting social bonding or conflict.

\socialrulesFull (\socialrules) [-10-0]: This dimension measures adherence to social norms and legal rules.

\financialbenefitsFull (\financialbenefits) [-5-5]: This examines whether the agents achieve financial or material advantages in the short or long term.

Finally, another dimension called \textbf{Overall Score} is introduced which is simply the mean value of the scores on seven dimensions. This provides a single metric for evaluating general social intelligence. In \sotopia \cite{zhou2023sotopia}, the authors show how all this multi-dimensional framework is able to comprehensively evaluate social intelligence.

In \lifelongsotopia, while we evaluate the agents on all seven dimensions, we only focus on \believability and \goalcompletion as they give us the most insight into how these models behave over lifelong social interactions. As can be seen in Figure \ref{fig:other_dims_combined}, the rest of the dimensions do not show a lot of variance over lifelong interactions. Hence, we are not able to tell much about their specific contributions in distinguishing between agent performance across episodes.
}


\input{fig_tab_alg/other_dims/main}