A method that accelerates training of deep neural networks by reducing Internal Covariant Shift

Internal Covariant Shift?

Covariate shift means that the distributions of inputs change but the conditional distribution of outputs is unchanged, causing the variety of non-stationarity. For example, if you train your classifier by showing images of black cats, the performance of that classifier will not be so great when it is presented with images of non-black cats.

In a deep neural network, as the parameters of previous layers change, the distribution of each layer’s inputs(activations of previous layers) also changes, called “internal covariant shift”. This phenomenon is described in the above image with each person has different levels of inputs and outputs. To train a model more stable, we set a considerable lower learning rate to finetune so as to minimize the effect of internal covariant shift, but this leads to a longer training time.

Batch Normalization

To reduce internal covariant shift, we use batch normalization. First, we calculate the mean and the standard deviation of each layers’ outputs(before non-linearity). Note that batch normalization cannot be applied on small batches, because we use batches to estimate the statistics of the whole training data.(It is hard to calculate 𝜇 and 𝜎 of all training data during training)

Then we normalize every output of each layers and feed into the non-linearity.(The model can also learn 𝛽 and 𝛾 which are independent of input data if we don’t want the 𝜇 and 𝜎 are 0 and 1 respectively.)

At testing stage, we don’t have batches to calculate 𝜇 and 𝜎 so the practical solution is that we compute the moving average of 𝜇 and 𝜎 of the batches during training.

Benefit of batch normalization

  1. Speed up training time because we can set large learning rate and don’t have to worry about covariant shift.
  2. Less vanishing gradients especially for those activation functions containing saturation regions(sigmoid, tanh…etc).
  3. Less affected by initialization
  4. Less probability of overfitting because less affected by noise when testing(we normalize the data)

Implementation on Tensorflow

def batch_normalization_layer(x, n_out, phase_train):
beta = tf.Variable(tf.constant(0.0, shape=[n_out]),
gamma = tf.Variable(tf.constant(1.0, shape=[n_out]),
batch_mean, batch_var = tf.nn.moments(x, [0,1,2])
ema = tf.train.ExponentialMovingAverage(decay=0.5)

def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
    mean, var = tf.cond(phase_train,
lambda: (ema.average(batch_mean),
return= tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)




Source: Deep Learning on Medium