Mimicking Evolution with Reinforcement Learning

João Abrantes; James John Butterworth; Arnaldo J. Abrantes; Frans A Oliehoek

Mimicking Evolution with Reinforcement Learning

João Abrantes, James John Butterworth, Arnaldo J. Abrantes, Frans A Oliehoek

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone

Keywords: reinforcement learning, evolution, multi-agent, alife, open-endedness

TL;DR: We propose a reward function that is aligned with the fitness function. This means that when an agent is learning to maximise this reward, it is also learning to maximise the survival and reproduction of its genes.

Abstract: In nature, there are two processes driving the development of the brain: evolution and learning. Evolution acts slowly, across generations, and amongst other things, it defines what agents learn by changing their internal reward function. Learning acts fast, within one’s lifetime, and it quickly updates agents’ policies to maximise the evolved reward function. Although previous work has emulated both of these processes working in tandem, the optimisation of the reward function in order to serve the aims of the evolutionary process is very computationally expensive. This work proposes a fixed reward function, the evolutionary reward, that aims to maximise the number of current (and future) genetically similar agents. Furthermore, we propose a way to approximate the joint action value by averaging the action values of other agents weighted by their genetic similarity. In a finite environment with limited resources this techniques drives improved survival mechanisms and reproductive success. Given that this reward function is fixed, we avoid the computationally intense process of optimising it. We demonstrate the viability of our evolutionary reward by testing it in two bio-inspired, open-ended environments and monitoring a number of metrics such as population size and life expectancy. We compare our technique with the state-of-the-art evolutionary algorithm: CMA-ES, and show the superiority of work at producing agents that maximise the number of its genes across time.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

11 Replies

Loading