- x(t)= x(t-1)–step* f'(x(t))
Nesterov加速的自适应动量估计或Nadam算法是对自适应运动估计(Adam)优化算法的扩展,添加了Nesterov的加速梯度(NAG)或Nesterov动量,这是一种改进的动量。更广泛地讲,Nadam算法是对梯度下降优化算法的扩展。Timothy Dozat在2016年的论文“将Nesterov动量整合到Adam中”中描述了该算法。尽管论文的一个版本是在2015年以同名斯坦福项目报告的形式编写的。动量将梯度的指数衰减移动平均值(第一矩)添加到梯度下降算法中。这具有消除嘈杂的目标函数和提高收敛性的影响。Adam是梯度下降的扩展,它增加了梯度的第一和第二矩,并针对正在优化的每个参数自动调整学习率。NAG是动量的扩展,其中动量的更新是使用对参数的预计更新量而不是实际当前变量值的梯度来执行的。在某些情况下,这样做的效果是在找到最佳位置时减慢了搜索速度,而不是过冲。
该算法在从t = 1开始的时间t内迭代执行,并且每次迭代都涉及计算一组新的参数值x,例如。从x(t-1)到x(t)。如果我们专注于更新一个参数,这可能很容易理解该算法,该算法概括为通过矢量运算来更新所有参数。首先,计算当前时间步长的梯度(偏导数)。
- g(t)= f'(x(t-1))
接下来,使用梯度和超参数“ mu”更新第一时刻。
- m(t)=mu* m(t-1)+(1 –mu)* g(t)
然后使用“ nu”超参数更新第二时刻。
- n(t)= nu * n(t-1)+(1 – nu)* g(t)^ 2
- mhat =(mu * m(t)/(1 – mu))+((1 – mu)* g(t)/(1 – mu))
- nhat = nu * n(t)/(1 – nu)
- x(t)= x(t-1)– alpha /(sqrt(nhat)+ eps)* mhat
- alpha:初始步长(学习率),典型值为0.002。
- mu:第一时刻的衰减因子(Adam中的beta1),典型值为0.975。
- nu:第二时刻的衰减因子(Adam中的beta2),典型值为0.999。
- # objective function
- def objective(x, y):
- return x**2.0 + y**2.0
- # 3d plot of the test function
- from numpy import arange
- from numpy import meshgrid
- from matplotlib import pyplot
- # objective function
- def objective(x, y):
- return x**2.0 + y**2.0
- # define range for input
- r_min, r_max = -1.0, 1.0
- # sample input range uniformly at 0.1 increments
- xaxis = arange(r_min, r_max, 0.1)
- yaxis = arange(r_min, r_max, 0.1)
- # create a mesh from the axis
- x, y = meshgrid(xaxis, yaxis)
- # compute targets
- results = objective(x, y)
- # create a surface plot with the jet color scheme
- figure = pyplot.figure()
- axis = figure.gca(projection='3d')
- axis.plot_surface(x, y, results, cmap='jet')
- # show the plot
- pyplot.show()
运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f(0,0)= 0的熟悉的碗形状。
- # contour plot of the test function
- from numpy import asarray
- from numpy import arange
- from numpy import meshgrid
- from matplotlib import pyplot
- # objective function
- def objective(x, y):
- return x**2.0 + y**2.0
- # define range for input
- bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
- # sample input range uniformly at 0.1 increments
- xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
- yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
- # create a mesh from the axis
- x, y = meshgrid(xaxis, yaxis)
- # compute targets
- results = objective(x, y)
- # create a filled contour plot with 50 levels and jet color scheme
- pyplot.contourf(x, y, results, levels=50, cmap='jet')
- # show the plot
- pyplot.show()
x ^ 2的导数在每个维度上均为x * 2。
- f(x)= x ^ 2
- f'(x)= x * 2
- # derivative of objective function
- def derivative(x, y):
- return asarray([x * 2.0, y * 2.0])
- # generate an initial point
- x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
- score = objective(x[0], x[1])
- # initialize decaying moving averages
- m = [0.0 for _ in range(bounds.shape[0])]
- n = [0.0 for _ in range(bounds.shape[0])]
然后,我们运行由“ n_iter”超参数定义的算法的固定迭代次数。
- ...
- # run iterations of gradient descent
- for t in range(n_iter):
- ...
- ...
- # calculate gradient g(t)
- g = derivative(x[0], x[1])
- ...
- # build a solution one variable at a time
- for i in range(x.shape[0]):
- ...
- # m(t) = mu * m(t-1) + (1 - mu) * g(t)
- m[i] = mu * m[i] + (1.0 - mu) * g[i]
- # nhat = nu * n(t) / (1 - nu)
- nhat = nu * n[i] / (1.0 - nu)
- # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
- n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
- # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
- mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
- # nhat = nu * n(t) / (1 - nu)
- nhat = nu * n[i] / (1.0 - nu)
- # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
- x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
- # evaluate candidate point
- score = objective(x[0], x[1])
- # report progress
- print('>%d f(%s) = %.5f' % (t, x, score))
- # gradient descent algorithm with nadam
- def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
- # generate an initial point
- x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
- score = objective(x[0], x[1])
- # initialize decaying moving averages
- m = [0.0 for _ in range(bounds.shape[0])]
- n = [0.0 for _ in range(bounds.shape[0])]
- # run the gradient descent
- for t in range(n_iter):
- # calculate gradient g(t)
- g = derivative(x[0], x[1])
- # build a solution one variable at a time
- for i in range(bounds.shape[0]):
- # m(t) = mu * m(t-1) + (1 - mu) * g(t)
- m[i] = mu * m[i] + (1.0 - mu) * g[i]
- # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
- n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
- # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
- mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
- # nhat = nu * n(t) / (1 - nu)
- nhat = nu * n[i] / (1.0 - nu)
- # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
- x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
- # evaluate candidate point
- score = objective(x[0], x[1])
- # report progress
- print('>%d f(%s) = %.5f' % (t, x, score))
- return [x, score]
- # seed the pseudo random number generator
- seed(1)
- # define range for input
- bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
- # define the total iterations
- n_iter = 50
- # steps size
- alpha = 0.02
- # factor for average gradient
- mu = 0.8
- # factor for average squared gradient
- nu = 0.999
- # perform the gradient descent search with nadam
- best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
- # summarize the result
- print('Done!')
- print('f(%s) = %f' % (best, score))
- >40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001
- >41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001
- >42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001
- >43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001
- >44 f([-0.00011368 -0.00211861]) = 0.00000
- >45 f([-0.00011547 -0.00185511]) = 0.00000
- >46 f([-0.0001075 -0.00161122]) = 0.00000
- >47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000
- >48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000
- >49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000
- Done!
- f([-5.54299505e-05 -1.00116899e-03]) = 0.000001
- # gradient descent algorithm with nadam
- def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):
- solutions = list()
- # generate an initial point
- x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
- score = objective(x[0], x[1])
- # initialize decaying moving averages
- m = [0.0 for _ in range(bounds.shape[0])]
- n = [0.0 for _ in range(bounds.shape[0])]
- # run the gradient descent
- for t in range(n_iter):
- # calculate gradient g(t)
- g = derivative(x[0], x[1])
- # build a solution one variable at a time
- for i in range(bounds.shape[0]):
- # m(t) = mu * m(t-1) + (1 - mu) * g(t)
- m[i] = mu * m[i] + (1.0 - mu) * g[i]
- # n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
- n[i] = nu * n[i] + (1.0 - nu) * g[i]**2
- # mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
- mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))
- # nhat = nu * n(t) / (1 - nu)
- nhat = nu * n[i] / (1.0 - nu)
- # x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
- x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat
- # evaluate candidate point
- score = objective(x[0], x[1])
- # store solution
- solutions.append(x.copy())
- # report progress
- print('>%d f(%s) = %.5f' % (t, x, score))
- return solutions
- # seed the pseudo random number generator
- seed(1)
- # define range for input
- bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
- # define the total iterations
- n_iter = 50
- # steps size
- alpha = 0.02
- # factor for average gradient
- mu = 0.8
- # factor for average squared gradient
- nu = 0.999
- # perform the gradient descent search with nadam
- solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
- # sample input range uniformly at 0.1 increments
- xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
- yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
- # create a mesh from the axis
- x, y = meshgrid(xaxis, yaxis)
- # compute targets
- results = objective(x, y)
- # create a filled contour plot with 50 levels and jet color scheme
- pyplot.contourf(x, y, results, levels=50, cmap='jet')
- # plot the sample as black circles
- solutions = asarray(solutions)
- pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
