\section{Related Work}
\label{sec:related_work}

\textbf{Social Intelligence in LLMs}\quad Social intelligence refers to the capacity to effectively navigate and manage social interactions and includes key competencies such as social perception, social knowledge, social memory, social reasoning, social creativity, and social interaction ~\citep{mathur2024advancingsocialintelligenceai}.

Evaluating social intelligence in large language models (LLMs) has presented unique challenges. Most evaluations have concentrated on isolated tasks that assess logic, problem-solving, or academic intelligence, while overlooking real-world social dynamics ~\citep{xu2024academicallyintelligentllmsnecessarily}.

Recent studies have begun to assess social intelligence in LLMs through various methods. For instance, EmoBench ~\citep{sabour2024emobenchevaluatingemotionalintelligence} introduced a benchmark to evaluate Emotional Intelligence in LLMs, focusing on emotional understanding and application. Their results revealed that while LLMs can apply emotional concepts, they struggle significantly with emotional understanding, indicating a gap between current LLM capabilities and average human performance in this area. Similarly, InterIntent ~\citep{liu2024interintentinvestigatingsocialintelligence} assessed social intelligence by analyzing how well LLMs comprehend and manage player intentions in a game setting, using social deduction games to evaluate these models in dynamic, interactive contexts. Furthermore, SocialBench ~\citep{chen2024socialbenchsocialityevaluationroleplaying} introduced a benchmark for role-playing agents to assess sociality at both individual and group interaction levels.

However, there has been little to no exploration of how LLMs manage long-term social interactions that unfold over extended contexts, such as those lasting hours, days, or even longer ~\citep{mathur2024advancingsocialintelligenceai}. Our work seeks to address this gap by specifically evaluating the social intelligence of language models over long contexts using multi-episode chaining in the \sotopia environment.

\textbf{Evaluation of Long-context LLMs}\quad Recent years have seen the advent of multiple techniques that have extended the context length of LLMs from the standard 4096 tokens to 128k or even 1M tokens ~\citep{dao2022flashattentionfastmemoryefficientexact, lou2024sparserfastermoreefficient, xiao2024efficientstreaminglanguagemodels, liu2024worldmodelmillionlengthvideo}. Evaluating these systems presents a unique challenge due to the difficulty in manually annotating outputs from such long inputs. Several benchmarks, including Long-Range Arena ~\citep{tay2020longrangearenabenchmark}, Longbench ~\citep{bai2023longbench}, and L-Eval ~\citep{an2023levalinstitutingstandardizedevaluation}, have emerged to address this issue.

Despite improvements, studies reveal that long-context LLMs still struggle with certain tasks. For example, Lost in the Middle ~\citep{liu2024lost} showed these models often miss key information buried in the middle of long inputs. Similarly, LongICLBench ~\citep{li2024longcontextllmsstrugglelong} demonstrated that models face challenges in handling long in-context learning tasks. RULER ~\citep{hsieh2024rulerwhatsrealcontext} introduced a variant of the Needle in a Haystack test ~\citep{gka23}, revealing performance declines for very long contexts.

\textbf{Lifelong ML} \quad Lifelong, or continual learning, is an ML paradigm that aims to replicate the human ability to learn and accumulate knowledge over time without forgetting previously learned information, while also using past knowledge to enhance the learning of new tasks with minimal effort \citep{ke2023continuallearningnaturallanguage}. A lifelong learning system can continuously learn numerous tasks from multiple domains throughout its lifetime. Consequently, such a system is capable of both retaining past information and using the acquired knowledge to support the learning of new tasks \citep{chen2018lifelong}. Our benchmark, \lifelongsotopia, is designed to evaluate the social intelligence of state-of-the-art LLM-based agents and assess their performance in long-term or lifelong social interactions.