Original article can be found here (source): Artificial Intelligence on Medium

The ReinLife environment is built upon a NumPy matrix of size n * m where each grid has a pixel size of 24 by 24. Each location within the matrix represents a location that can be occupied by only a single entity.


Agents are entities or organisms in the simulation that can move, attack, reproduce and act independently.

The Agents

Each agent has the following characteristics:

Health — Starts at 200 and decreases with 10 each step

Age — Increases with 1 each step and cannot exceed 50

Gene — Represents kinship with other agents


An agent can occupy any unoccupied space and, from that position, can move up, down, left, or right. Entities cannot move diagonally. The environment has no walls, which means that if an entity moves left from the most left position in the NumPy matrix, then it will move to the most right position. In other words, the environment is a fully-connected world.


An agent can attack in one of four directions:

They stand still if they attack. However, since it is the first thing they do, the other agent cannot move away. When the agent successfully attacks another agent, the other agent dies and the attacker increases its health. Moreover, if the agent successfully attacks another agent, its border becomes red.


When a new entity is reproduced, it inherits its brain from its parents.

When a new entity is produced, it inherits its brain from one of the best agents we have seen so far. A list of 10 of the best agents is tracked during the simulation.


The field of view of each agent is a square surrounding the agent. Since the world is fully-connected, the agent can see “through” walls.

The input for the neural network can be seen in the image below:

There are three grids of 7×7 (example shows 5×5) that each shows a specific observation of the environment. In total, there are 3 * (7 * 7) + 6 = 153 input values.


The reward structure is tricky as you want to minimize the amount you steer the entity towards certain behavior. For that reason, I’ve adopted a simple and straightforward fitness measure, namely:

Where r is the reward given to agent i at time t. The δ is the Kronecker delta which is one if the gene of agent i, gi, equals the gene of agent j, gj, and zero otherwise. n is the total number of agents that are alive at time t. Thus, the reward essentially checks how many agents are alive that share a gene with agent i at time t and divides by the total number of agents alive.

Thus, an agent’s behavior is only steered towards making sure its gene lives on for as long as possible.


Currently, the following algorithms are implemented that can be used as brains:

  • Deep Q Network (DQN)
  • Prioritized Experience Replay Deep Q Network (PER-DQN)
  • Double Dueling Deep Q Network (D3QN)
  • Prioritized Experience Replay Double Dueling Deep Q Network (PER-D3QN)
  • Proximal Policy Optimization (PPO)

3. Results

The animation at the top of the page is an example of learned behavior. The ReinLife package allows you to choose one or more of the algorithms above and have them battle it out.

To give you an example of how that might look like, I have taken two copies of a PER-D3QN brain and let them battle it out against each other. There can only be two genes (two colors) and each gene is supplied with its own brain.