\section{Conclusion}

In this paper, we propose \lifelongsotopia, a benchmark to evaluate the social intelligence of LLM-based agents over lifelong social interactions. We find that when equipped with their entire past interactions as memory, the language agents show a consistent decline in both believability and goal completion, indicating issues of inconsistency and a lack of long-term social intelligence. While the performance of the agents improves significantly when equipped with a more advanced memory method, they still show a steep decline in goal completion when tested on harder social scenarios that require explicit use of knowledge gained from previous interactions. In contrast, humans maintain their performance throughout, employing various techniques to do so. This suggests a significant gap between the social abilities of humans and current state-of-the-art LLMs, highlighting the need for further research to improve the social intelligence of these models. The limitations and ethical considerations related to our work can be found in the Appendix sections \S \ref{appendix:limitations} and \S \ref{appendix:ethical} respectively. Our findings also demonstrate that \lifelongsotopia provides a robust platform for evaluating language agents over long-term social interactions.