Line Search Methods

In line search methods, each iteration is given by \(x_{k+1} = x_k + \alpha_k p_k\), where \(p_k\) is the search direction and \(\alpha_k\) is the step length.

The search direction is often of the form \(p_k = -B_k^{-1} \nabla f_k\) where \(B_k\) is a symmetric and non-singular matrix. The form of \(p_k\) is dependent on algorithm choice.

The ideal step length would be \(min_{\alpha>0} f(x_k + \alpha p_k)\) but this is generally too expensive to calculate. Instead an inexact line search condition such as the Wolfe Conditions can be used:

\[\begin{split}f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T p_k \\ f(x_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f_k^T p_k\end{split}\]

With \(0<c_1<c_2<1\). Here, the first condition ensures that \(\alpha_k\) gives a sufficient decrease in \(f\), whilst the second condition rules out unacceptably short steps. [Nocedal]

Steepest Descent (steepest_descent)

Simple method where search direction \(p_k\) is set to be \(-\nabla f_k\), i.e. the direction along which \(f\) decreases most rapidly.

Advantages:
  • Low storage requirements

  • Easy to compute

Disadvantages:
  • Slow convergence for nonlinear problems

[Nocedal]

Conjugate Gradient (conjugate_gradient)

Conjugate Gradient methods have a faster convergence rate than Steepest Descent but avoid the high computational cost of methods where the inverse Hessian is calculated.

Given an iterate \(x_0\), evaluate \(f_0 = f(x_0), \nabla f_0 = \nabla f(x_0)\).

Set \(p_0 \leftarrow - \nabla f_0, k \leftarrow 0\)

Then while \(\nabla f_k \neq 0\):

Carry out a line search to compute the next iterate, then evaluate \(\nabla f_{k+1}\) and use this to determine the subsequent conjugate direction \(p_{k+1} = - \nabla f(x_{k+1}) + \beta_k p_k\)

Different variations of the Conjugate Gradient algorithm use different formulas for \(\beta_k\), for example:

Fletcher-Reeves: \(\beta_{k+1} = \frac{f_{k+1}^T \nabla f_{k+1}}{\nabla f_k^T \nabla f_k}\) Polak-Ribiere: \(\beta_{k+1} = \frac{ \nabla f_{k+1}^T ( \nabla f_{k+1} - \nabla f_k)}{\|\nabla f_k\|^2}\)

[Nocedal] [Poczos]

Advantages:
  • Considered to be one of the best general purpose methods.

  • Faster convergence rate compared to Steepest Descent and only requires evaluation of objective function and it’s gradient - no matrix operations.

Disadvantages:
  • For Fletcher-Reeves method it can be shown that if the method generates a bad direction and step, then the next direction and step are also likely to be bad. However, this is not the case with the Polak Ribiere method.

  • Generally, the Polak Ribiere method is more efficient that the Fletcher-Reeves method but it has the disadvantage is requiring one more vector of storage.

[Nocedal]

BFGS (bfgs)

Most popular quasi-Newton method, which uses an approximate Hessian rather than the true Hessian which is used in a Newton line search method.

Starting with an initial Hessian approximation \(H_0\) and starting point \(x_0\):

While \(\| \nabla f_k \| > \epsilon\):

Compute the search direction \(p_k = -H_k \nabla f_k\)

Then find next iterate \(x_{k+1}\) by performing a line search.

Next, define \(s_k = x_{k+1}-x_k\) and \(y_k = \nabla f_{k+1} - \nabla f_k\), then compute

\[H_{k+1} = (I - \rho_k s_k y_k^T)H_k(I - \rho_k y_k s_K^T) + \rho_k s_k s_k^T\]

with \(\rho_k = \frac{1}{y_k^T s_k}\)

Advantages:
  • Superlinear rate of convergence

  • Has self-correcting properties - if there is a bad estimate for \(H_k\), then it will tend to correct itself within a few iterations.

  • No need to compute the Jacobian or Hessian.

Disadvantages:
  • Newton’s method has quadratic convergence but this is lost with BFGS.

[Nocedal]

Gauss Newton (gauss_newton)

Modified Newton’s method with line search. Instead of solving standard Newton equations

\[\nabla^2 f(x_k)p = -\nabla f(x_k),\]

solve the system

\[J_k^T J_k p_k^{GN} = - J_k^T r_k\]

(where \(J_k\) is the Jacobian) to obtain the search direction \(p_k^{GN}\). The next iterate is then set as \(x_{k+1} = x_k + p_k^{GN}\).

Here, the approximation of the Hessian \(\nabla^2 f_k \approx J_k^T J_k\) has been made, which helps to save on computation time as second derivatives are not calculated.

Advantages:
  • Calculation of second derivatives is not required.

  • If residuals or their second order partial derivatives are small, then \(J_k^T J_k\) is a close approximation to \(\nabla^2 f_k\) and convergence of Gauss-Newton is fast.

  • The search direction \(p_J^{GN}\) is always a descent direction as long as \(J_k\) has full rank and the gradient \(\nabla f_k\) is nonzero.

Disadvantages:
  • Without a good initial guess, or if the matrix \(J_k^T J_k\) is ill-conditioned, the Gauss Newton Algorithm is very slow to converge to a solution.

  • If relative residuals are large, then large amounts of information will be lost.

  • \(J_k\) must be full rank.

[Nocedal] [Floater]

Nocedal(1,2,3,4,5,6)

Jorge Nocedal, Stephen J. Wright (2006), Numerical Optimization

Poczos

Barnabas Poczos, Ryan Tibshirani (2012), Lecture 10: Optimization, School of Computer Science, Carnegie Mellon University

Floater

Michael S. Floater (2018), Lecture 13: Non-linear least squares and the Gauss-Newton method, University of Oslo