To build any neural network requires lots of must-do operations. One of them is initializing the Neural Network. Initializing neural network is an essential part of training NN. Here is a common training process for neural networks:

1. Initialize parametres

2. Choose an optimization algorithm

3. Repeat these steps:

3.1 Forward propagate an input

3.2 Compute the cost function

3.3 Compute the Gradience of the cost with respect to parameters using backpropagation

3.4 Update each parameter using the gradients, according to the optimization algorithm

There are numerous methods of initialization weights:

Setting all weights to 0 or 1:

In fact, initializing the weights in NN with zeros or any constant will lead neurons to learn the same features during training. Imagine model with two hidden layers, where the biases are initialized with 0, and weights are with any constant A, such initialization will lead to identical gradients so that neurons will be learning the same things.

As we set all weights to 0, the activation in hidden layers is also the same. The problem arises as to which weight the network should update or by how much. For backpropagation operation, it is difficult to minimize the loss. Initializing weights to 1 leads to the same problem.

In PyTorch, nn.init is used to initialize weights of layers e.g to change Linear layer’s initialization method:

Uniform Distribution

The Uniform distribution is another way to initialize the weights randomly from the uniform distribution. Every number in the uniform distribution has an equal probability to be picked. In PyTorch, the Linear layer is initialized with the uniform initialization, nn.init.kaiming_uniform_ is set by default.

After 2 epochs

General Rule

The main idea of General Rule is to initialize the weights without being too small but close to 0, in the range [-y,y], where y = 1/sqrt(n) (n is the number of inputs given to neuron)

General Rule

Xavier Initialization

Xavier Initialization Glorot, X. & Bengio, Y. (2010) is a Gaussian initialization heuristic that keeps the variance of the input to a layer the same as that of the output of the layer. This ensures that the variance remains the same throughout the network.

PyTorch offers uniform and norman distributed initializations for Xavier heuristic.

The gain value depends on the type of the nonlinearity used in the layer and can be obtained using the torch.nn.init.calculate_gain() function in PyTorch. For ReLU networks use the default gain=1.