Ada in AdaGrad means "Adaptive". What kind of Adaptive means that the learning rate in NN can be adjusted adaptively, and each parameter has its own special adjustment, not the learning rate of all parameters can be adjusted at the same time (sharing a learning rate).

In N N learning, learning rate is very important for learning effect. Learning rate is too high, too much at a time, too easy to diverge, jump around, difficult to converge slowly; learning rate is too small, then slow learning, low efficiency. So adaptive reduction of learning rate is an easy-to-think solution, also known as learing rate decay, learning rate decay.

Adagrad further developed this idea by assigning "customized" values to each parameter:
L is a loss function
W is the weight matrix
The latter item of the first formula is the sum of the gradients of the loss function for all the weighted parameters. If the learning rate is divided by the root number h, then as the learning proceeds, it is equivalent to reducing the learning rate, which is well understood. Beauty of Mathematics, Power of Beauty!

The following experimental functions and function images are available in my other two blogs on parameter optimization. You can compare the convergence effects of the four algorithms for this function.

With Adagrad, the initial learning rate needs to be set relatively large, and then as learning proceeds, the learning rate will adjust and decrease by itself. It can be clearly seen from the graph that the closer to the minimum point, the smaller each step. Compared with SDG and momentum gradient method, the effect of Adagrad is very good.

import numpy as np
import matplotlib.pyplot as plt

class AdaGrad:
    def __init__(self, lr=0.01): = lr
        self.h = None

    def update(self, params, grads):
        if self.h is None:  # First call
            self.h = {}
            for key, val in params.items():  # Initialize dictionary variable h
                self.h[key] = np.zeros_like(val)

        for key in params.keys():
            self.h[key] += grads[int(key)] * grads[int(key)]
            params[key] -= * grads[int(key)] / (np.sqrt(self.h[key] + 1e-7))

        return params

def numerical_gradient(f, x):
    h = 1e-4
    x = np.array(list(init_x.values()))  # Convert to ndarray
    grad = np.zeros_like(x)

    for idx in range(x.size):
        temp = x[idx]
        x[idx] = temp + h
        fxh1 = f(x)

        x[idx] = temp - h
        fxh2 = f(x)

        grad[idx] = (fxh1 - fxh2) / (2 * h)
        x[idx] = temp

    return grad

def func2(x):
    return (x[0]**2) / 20 + x[1] ** 2

def adagrad_update(init_x, stepnum):
    x = init_x
    x_history = []

    for i in range(stepnum):
        grad = numerical_gradient(func2, x)
        x = m.update(x, grad)

    return x, np.array(x_history)

init_x = {}  # starting point
init_x['0'] = -7.0
init_x['1'] = 2.0
learning_rate = 0.9  
m = AdaGrad(lr=learning_rate)
stepnum = 45  
x, x_history = adagrad_update(init_x=init_x, stepnum=stepnum)

axis_range = 10
x = np.arange(-axis_range, axis_range, 0.05)
y = np.arange(-axis_range, axis_range, 0.05)
X, Y = np.meshgrid(x, y)
z = np.array([X, Y])

# contour
plt.contour(x, y, func2(z),np.arange(0,10,2), zdir='z', cmap='binary')
# Draw all points found by gradient descent
plt.plot(x_history[:, 0], x_history[:, 1], '+', color='blue')

# Inter-dot Connection
for i in range(x_history.shape[0]-2):
    tmp = x_history[i:i+2]
    tmp = tmp.T
    plt.plot(tmp[0], tmp[1], color='blue')
# Marking Minimum Position
plt.plot(0, 0, 'o', color='r')
plt.title('AdaGrad  0.05x^2 + y^2 ')