Optimization

Gradient Descent

Batch Gradient Descent(BGD)

Let data has nn observations, model has parameter θ\theta, loss function as J(θ)J(\theta), the gradient at point (xi,yi)(x_i,y_i) is θJ(θt,xi,yi)\bigtriangledown_\theta J(\theta_t,x_i,y_i), learning rate is η\eta. Then we have the formula:

θt+1=θtηti=1nθJ(θt,xi,yi)\theta_{t+1}=\theta_t-\eta_t\sum_{i=1}^{n}\bigtriangledown_\theta J(\theta_t,x_i,y_i)

So everytime we update our parameter, we need to compute whole data. Then this method is slow for large data size. But everytime it caculates the average gradient of whole data, the result is global optimization.

Stochastic Gradient Descent(SGD)

Not like BGD, every time it only use one data to caculate the gradient:

θt+1=θtηtθJ(θt,xi,yi)\theta_{t+1}=\theta_t-\eta_t\bigtriangledown_\theta J(\theta_t,x_i,y_i)

SGD is fast but easy to lead to a local optimization which means the accurate is decreased.

Min-batch Gradient Descent

Combine teh BGD abd SGD, it used a subset of data to do the update. We have xx times update and we select a subset size as mm where m<nm<n

θt+1=θtηti=xx+m1θJ(xi,yi,θt)\theta_{t+1}=\theta_t-\eta_t\sum_{i=x}^{x+m-1}\bigtriangledown_\theta J(x_i,y_i,\theta_t)

When we have small dataset, we better use BGD. And in large dataset, we may use SGD or MBGD.

Least Square Method

Newton Method

For the target function f(x)f(x): we find the Taylor Expression

f(x)=f(x0)+f(x0)T(xx0)+122f(x0)(xx0)+...f(x) = f(x_0) + \bigtriangledown f(x_0)^T(x-x_0) + \frac{1}{2} \bigtriangledown^2 f(x_0)(x-x_0)+ ...

Calculating the gradient for both side and ignore the 2\bigtriangledown^2 above elements:

(x)=f(x0)+2f(x0)(xx0)\bigtriangledown(x) = \bigtriangledown f(x_0) + \bigtriangledown ^2 f(x_0)(x-x_0)

And we let (x)=0\bigtriangledown (x) = 0 and denote 2f(x0)\bigtriangledown ^2 f(x_0) as Hessian matrix HH

f(x0)+H(xx0)=0\bigtriangledown f(x_0) + H(x-x_0) =0

x=x0H1f(x0)x = x_0 - H^{-1} \bigtriangledown f(x_0)

So we start from x0x_0, repeate calculate the xk+1=xkHk1f(xk)x_{k+1} = x_k - H_k^{-1} \bigtriangledown f(x_k) until f(xk)\bigtriangledown f(x_k) almost equal to 0

Quasi-Newton Method

Calculating the Hessian matrix every time cost a lot of computation. So there are many methods which find a similar Hessian matrix to finish the loop.

AdaGrad

It adjusted the learning rate for the different gradient at each features. Avoiding the same learning rate hard to fit all features.

Here is the function to update the parameter:

θt+1=θtηεI+diag(Gt)gt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\varepsilon I + diag(G_t)}} \cdot g_t

Where

  • θ\theta is the parameter need to be updated
  • η\eta Is the initial learning rate
  • ε\varepsilon is the some small quantity that used to avoid the division is 0
  • II is the identity matrix
  • gtg_t is the gradient estimate at time tt

gt=1ni=1nθJ(xi,yi,θt)g_t = \frac{1}{n} \sum_{i=1}^{n} \bigtriangledown_\theta J(x_i, y_i, \theta_t)

  • GtG_t Matrix is the key point which is teh sum of the outer product of the gradients until time-step tt

    Gt=τ=1tgτgτTG_t = \sum_{\tau=1}^{t}g_{\tau}g_{\tau}^T

    Note that we can use full GtG_t matrix but it is impractical especially in high-dimension problem. Hence we use the inverse the square root of the diagonal GtG_t is easier to be computed.

Rewritting the AdaGrad function into matrix form:

multiply gtg_t

compare to the SGD(stochastic gradient descent)

Where SGD used the same learning rate for each dimension. AdaGrad adapt the learning rate with the accumulated squared gradient at each iteration in each dimension.

AdaGrad algorithm performs best for sparse data because it decreases the learning rate faster for frequent parameters, and slower for parameters infrequent parameter. Unfortunately, there is some case that the effective learning rate `decreased very fast because we do accumulation of the gradients from the beginning of training. This can cause an issue that there is a point that the model will not learn again because the learning rate is almost zero.

Author: shixuan liu
Link: http://tedlsx.github.io/2019/10/17/optimization/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment