In N N learning, learning rate is very important for learning effect. Learning rate is too high, too much at a time, too easy to diverge, jump around, difficult to converge slowly; learning rate is too small, then slow learning, low efficiency. So adaptive reduction of learning rate is an easy-to-think solution, also known as learing rate decay, learning rate decay.

Adagrad further developed this idea by assigning "customized" values to each parameter:
L is a loss function
W is the weight matrix
The latter item of the first formula is the sum of the gradients of the loss function for all the weighted parameters. If the learning rate is divided by the root number h, then as the learning proceeds, it is equivalent to reducing the learning rate, which is well understood. Beauty of Mathematics, Power of Beauty!

The following experimental functions and function images are available in my other two blogs on parameter optimization. You can compare the convergence effects of the four algorithms for this function.

With Adagrad, the initial learning rate needs to be set relatively large, and then as learning proceeds, the learning rate will adjust and decrease by itself. It can be clearly seen from the graph that the closer to the minimum point, the smaller each step. Compared with SDG and momentum gradient method, the effect of Adagrad is very good.

```# AdaGrad.py
import numpy as np
import matplotlib.pyplot as plt

def __init__(self, lr=0.01):
self.lr = lr
self.h = None

if self.h is None:  # First call
self.h = {}
for key, val in params.items():  # Initialize dictionary variable h
self.h[key] = np.zeros_like(val)

for key in params.keys():
params[key] -= self.lr * grads[int(key)] / (np.sqrt(self.h[key] + 1e-7))

return params

h = 1e-4
x = np.array(list(init_x.values()))  # Convert to ndarray

for idx in range(x.size):
temp = x[idx]
x[idx] = temp + h
fxh1 = f(x)

x[idx] = temp - h
fxh2 = f(x)

grad[idx] = (fxh1 - fxh2) / (2 * h)
x[idx] = temp

def func2(x):
return (x[0]**2) / 20 + x[1] ** 2

x = init_x
x_history = []

for i in range(stepnum):
x_history.append(np.array(list(x.copy().values())))

return x, np.array(x_history)

init_x = {}  # starting point
init_x['0'] = -7.0
init_x['1'] = 2.0
learning_rate = 0.9
stepnum = 45

axis_range = 10
x = np.arange(-axis_range, axis_range, 0.05)
y = np.arange(-axis_range, axis_range, 0.05)
X, Y = np.meshgrid(x, y)
z = np.array([X, Y])

# contour
plt.figure()
plt.contour(x, y, func2(z),np.arange(0,10,2), zdir='z', cmap='binary')
# Draw all points found by gradient descent
plt.plot(x_history[:, 0], x_history[:, 1], '+', color='blue')

# Inter-dot Connection
for i in range(x_history.shape[0]-2):
tmp = x_history[i:i+2]
tmp = tmp.T
plt.plot(tmp[0], tmp[1], color='blue')
# Marking Minimum Position
plt.plot(0, 0, 'o', color='r')
plt.xlabel('x')
plt.ylabel('y')