Ada in AdaGrad means "Adaptive". What kind of Adaptive means that the learning rate in NN can be adjusted adaptively, and each parameter has its own special adjustment, not the learning rate of all parameters can be adjusted at the same time (sharing a learning rate).

In N N learning, learning rate is very important for learning effect. Learning rate is too high, too much at a time, too easy to diverge, jump around, difficult to converge slowly; learning rate is too small, then slow learning, low efficiency. So adaptive reduction of learning rate is an easy-to-think solution, also known as learing rate decay, learning rate decay.

Adagrad further developed this idea by assigning "customized" values to each parameter:

L is a loss function

W is the weight matrix

The latter item of the first formula is the sum of the gradients of the loss function for all the weighted parameters. If the learning rate is divided by the root number h, then as the learning proceeds, it is equivalent to reducing the learning rate, which is well understood. Beauty of Mathematics, Power of Beauty!

The following experimental functions and function images are available in my other two blogs on parameter optimization. You can compare the convergence effects of the four algorithms for this function.

With Adagrad, the initial learning rate needs to be set relatively large, and then as learning proceeds, the learning rate will adjust and decrease by itself. It can be clearly seen from the graph that the closer to the minimum point, the smaller each step. Compared with SDG and momentum gradient method, the effect of Adagrad is very good.

# AdaGrad.py import numpy as np import matplotlib.pyplot as plt class AdaGrad: def __init__(self, lr=0.01): self.lr = lr self.h = None def update(self, params, grads): if self.h is None: # First call self.h = {} for key, val in params.items(): # Initialize dictionary variable h self.h[key] = np.zeros_like(val) for key in params.keys(): self.h[key] += grads[int(key)] * grads[int(key)] params[key] -= self.lr * grads[int(key)] / (np.sqrt(self.h[key] + 1e-7)) return params def numerical_gradient(f, x): h = 1e-4 x = np.array(list(init_x.values())) # Convert to ndarray grad = np.zeros_like(x) for idx in range(x.size): temp = x[idx] x[idx] = temp + h fxh1 = f(x) x[idx] = temp - h fxh2 = f(x) grad[idx] = (fxh1 - fxh2) / (2 * h) x[idx] = temp return grad def func2(x): return (x[0]**2) / 20 + x[1] ** 2 def adagrad_update(init_x, stepnum): x = init_x x_history = [] for i in range(stepnum): x_history.append(np.array(list(x.copy().values()))) grad = numerical_gradient(func2, x) x = m.update(x, grad) return x, np.array(x_history) init_x = {} # starting point init_x['0'] = -7.0 init_x['1'] = 2.0 learning_rate = 0.9 m = AdaGrad(lr=learning_rate) stepnum = 45 x, x_history = adagrad_update(init_x=init_x, stepnum=stepnum) axis_range = 10 x = np.arange(-axis_range, axis_range, 0.05) y = np.arange(-axis_range, axis_range, 0.05) X, Y = np.meshgrid(x, y) z = np.array([X, Y]) # contour plt.figure() plt.contour(x, y, func2(z),np.arange(0,10,2), zdir='z', cmap='binary') # Draw all points found by gradient descent plt.plot(x_history[:, 0], x_history[:, 1], '+', color='blue') # Inter-dot Connection for i in range(x_history.shape[0]-2): tmp = x_history[i:i+2] tmp = tmp.T plt.plot(tmp[0], tmp[1], color='blue') # Marking Minimum Position plt.plot(0, 0, 'o', color='r') plt.xlabel('x') plt.ylabel('y') plt.title('AdaGrad 0.05x^2 + y^2 ') plt.show()