Line Search Methods
In line search methods, each iteration is given by \(x_{k+1} = x_k + \alpha_k p_k\), where \(p_k\) is the search direction and \(\alpha_k\) is the step length.
The search direction is often of the form \(p_k = -B_k^{-1} \nabla f_k\) where \(B_k\) is a symmetric and non-singular matrix. The form of \(p_k\) is dependent on algorithm choice.
The ideal step length would be \(min_{\alpha>0} f(x_k + \alpha p_k)\) but this is generally too expensive to calculate. Instead an inexact line search condition such as the Wolfe Conditions can be used:
With \(0<c_1<c_2<1\). Here, the first condition ensures that \(\alpha_k\) gives a sufficient decrease in \(f\), whilst the second condition rules out unacceptably short steps. [Nocedal]
Steepest Descent (steepest_descent
)
Simple method where search direction \(p_k\) is set to be \(-\nabla f_k\), i.e. the direction along which \(f\) decreases most rapidly.
- Advantages:
Low storage requirements
Easy to compute
- Disadvantages:
Slow convergence for nonlinear problems
Conjugate Gradient (conjugate_gradient
)
Conjugate Gradient methods have a faster convergence rate than Steepest Descent but avoid the high computational cost of methods where the inverse Hessian is calculated.
Given an iterate \(x_0\), evaluate \(f_0 = f(x_0), \nabla f_0 = \nabla f(x_0)\).
Set \(p_0 \leftarrow - \nabla f_0, k \leftarrow 0\)
Then while \(\nabla f_k \neq 0\):
Carry out a line search to compute the next iterate, then evaluate \(\nabla f_{k+1}\) and use this to determine the subsequent conjugate direction \(p_{k+1} = - \nabla f(x_{k+1}) + \beta_k p_k\)
Different variations of the Conjugate Gradient algorithm use different formulas for \(\beta_k\), for example:
Fletcher-Reeves: \(\beta_{k+1} = \frac{f_{k+1}^T \nabla f_{k+1}}{\nabla f_k^T \nabla f_k}\) Polak-Ribiere: \(\beta_{k+1} = \frac{ \nabla f_{k+1}^T ( \nabla f_{k+1} - \nabla f_k)}{\|\nabla f_k\|^2}\)
- Advantages:
Considered to be one of the best general purpose methods.
Faster convergence rate compared to Steepest Descent and only requires evaluation of objective function and it’s gradient - no matrix operations.
- Disadvantages:
For Fletcher-Reeves method it can be shown that if the method generates a bad direction and step, then the next direction and step are also likely to be bad. However, this is not the case with the Polak Ribiere method.
Generally, the Polak Ribiere method is more efficient that the Fletcher-Reeves method but it has the disadvantage is requiring one more vector of storage.
BFGS (bfgs
)
Most popular quasi-Newton method, which uses an approximate Hessian rather than the true Hessian which is used in a Newton line search method.
Starting with an initial Hessian approximation \(H_0\) and starting point \(x_0\):
While \(\| \nabla f_k \| > \epsilon\):
Compute the search direction \(p_k = -H_k \nabla f_k\)
Then find next iterate \(x_{k+1}\) by performing a line search.
Next, define \(s_k = x_{k+1}-x_k\) and \(y_k = \nabla f_{k+1} - \nabla f_k\), then compute
with \(\rho_k = \frac{1}{y_k^T s_k}\)
- Advantages:
Superlinear rate of convergence
Has self-correcting properties - if there is a bad estimate for \(H_k\), then it will tend to correct itself within a few iterations.
No need to compute the Jacobian or Hessian.
- Disadvantages:
Newton’s method has quadratic convergence but this is lost with BFGS.
Gauss Newton (gauss_newton
)
Modified Newton’s method with line search. Instead of solving standard Newton equations
solve the system
(where \(J_k\) is the Jacobian) to obtain the search direction \(p_k^{GN}\). The next iterate is then set as \(x_{k+1} = x_k + p_k^{GN}\).
Here, the approximation of the Hessian \(\nabla^2 f_k \approx J_k^T J_k\) has been made, which helps to save on computation time as second derivatives are not calculated.
- Advantages:
Calculation of second derivatives is not required.
If residuals or their second order partial derivatives are small, then \(J_k^T J_k\) is a close approximation to \(\nabla^2 f_k\) and convergence of Gauss-Newton is fast.
The search direction \(p_J^{GN}\) is always a descent direction as long as \(J_k\) has full rank and the gradient \(\nabla f_k\) is nonzero.
- Disadvantages:
Without a good initial guess, or if the matrix \(J_k^T J_k\) is ill-conditioned, the Gauss Newton Algorithm is very slow to converge to a solution.
If relative residuals are large, then large amounts of information will be lost.
\(J_k\) must be full rank.
- Nocedal(1,2,3,4,5,6)
Jorge Nocedal, Stephen J. Wright (2006), Numerical Optimization
- Poczos
Barnabas Poczos, Ryan Tibshirani (2012), Lecture 10: Optimization, School of Computer Science, Carnegie Mellon University
- Floater
Michael S. Floater (2018), Lecture 13: Non-linear least squares and the Gauss-Newton method, University of Oslo