In the first tutorial, I introduced the most basic Reinforcement learning method called Q-learning to solve the CartPole problem. Because of its computational limitations, it is working in simple environments, where the number of states and possible actions is relatively small. Calculating, storing, and updating Q-values for each action in the more complex environment is either impossible or highly inefficient. This is where the Deep Q-Network comes into play.

Background Information

The Deep Q-Learning has been introduced in 2013 in Playing Atari with Deep Reinforcement Learning paper by the DeepMind team. The first similar approach was made in 1992 using TD-gammon. The algorithm achieved a superhuman level of playing backgammon, but the method didn’t apply to games like chess, go, or checkers. DeepMind was able to surpass human performance in 3 out of 7 Atari games, using raw images and the same hyperparameters for all games. This was a breakthrough in the area of more general learning.

The basic idea is of DQN is that it combines Q-learning with deep learning. We get rid of Q-table, and use neural networks instead, to approximate the action-value function(Q(s,a)). The states are passed to the network, and as an output, we receive the estimated Q-values for each action.

DQN Architecture

In order to train the network, we need a target value, also known as a ground truth. The question is how we evaluate the loss function without actually having a labeled dataset?

Well, we create target values on the run using the Bellman equation.

Bellman equation and Loss function L

This method is called bootstrapping, we are trying to estimate something based on another estimation. Essentially we are estimating the current action value Q(s,a) by using an estimation of the future Q(s’,a). The problem arises when one network is used to predict both values. It is similar to the dog catching his own tail. Weights are updated to move predictions closer to the target Q-values, but target values will also be moving forward, cause we use the same network.

The solution has been presented in the DeepMind paper Human-level control through deep reinforcement learning. The idea is that we use a separate network to predict target values. Every C time step, weights from the policy network are copied to the target network. It provides more stability to the algorithm since our network is not trying to chase a nonstationary target.

In order to make a neural network works we need four values state(S), action(A), reward(R), future state(S’). These values are stored in a replay memory vector and then randomly sampled to train. This process is called experience replay and has been also introduced by DeepMind.

First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. To perform experience replay we store the agent’s experiences et=(st,at,rt,st+1)
The results of using experience replay and target network[3]

The Deep Q-Learning training process with experience replay and target network


Implementation details

  1. Environment
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.[4]

2. Network

The architecture is based on fully connected dense layers, with Relu activation function. The output layer is a fully connected linear layer with two outputs for each action.
As with many papers in reinforcement learning, I used the RMSProp optimizer.

3. Hyperparameters

4. Code

Version with plots available on Github.

  • defining models
  • experience replay
  • epsilon with coefficient and with a, b, c parameters to control the shape of the function
  • choosing an action
  • training function
  • training loop
  • testing loop


The first plot on the left shows epsilon value decayed each iteration during the episode. The right plot shows the epsilon function defined by three parameters to achieve a step-function shape.
Achieving maximum score in the episode is tightly related to the epsilon value. When the randomness of the actions is reduced, the neural network starts to train. The minimal value is kept to prevent stochastic state transitions memorization aka overfitting.

Training graphs
Testing on 100 episodes using model on the right above

The training process took roughly 4 hours using Intel Core i5–10210U CPU and the model seems to solve the environment.


The problem described here is using low-dimensional input, unlike most breakthrough models which are using raw images as an input and then extract all the features. Nevertheless, it is a good playground to understand how beautiful and powerful is the idea of Deep Q-Learning. To reduce the training time, I would go further trying different shapes of epsilon’s function. Another important hyperparameter is target model update frequency. It can be replaced by soft update, where we do not update target network at once, but frequently and very little[5]. Also, prioritizing experiences from replay memory can improve the effectiveness of the training process[6].








Solving Open AI’s CartPole using Reinforcement Learning Part-2 was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.