# Line Search Methods

In line search methods, each iteration is given by $$x_{k+1} = x_k + \alpha_k p_k$$, where $$p_k$$ is the search direction and $$\alpha_k$$ is the step length.

The search direction is often of the form $$p_k = -B_k^{-1} \nabla f_k$$ where $$B_k$$ is a symmetric and non-singular matrix. The form of $$p_k$$ is dependent on algorithm choice.

The ideal step length would be $$min_{\alpha>0} f(x_k + \alpha p_k)$$ but this is generally too expensive to calculate. Instead an inexact line search condition such as the Wolfe Conditions can be used:

$\begin{split}f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T p_k \\ f(x_k + \alpha_k p_k)^T p_k \geq c_2 \nabla f_k^T p_k\end{split}$

With $$0<c_1<c_2<1$$. Here, the first condition ensures that $$\alpha_k$$ gives a sufficient decrease in $$f$$, whilst the second condition rules out unacceptably short steps. [Nocedal]

## Steepest Descent (steepest_descent)

Simple method where search direction $$p_k$$ is set to be $$-\nabla f_k$$, i.e. the direction along which $$f$$ decreases most rapidly.

• Low storage requirements

• Easy to compute

• Slow convergence for nonlinear problems

[Nocedal]

## Conjugate Gradient (conjugate_gradient)

Conjugate Gradient methods have a faster convergence rate than Steepest Descent but avoid the high computational cost of methods where the inverse Hessian is calculated.

Given an iterate $$x_0$$, evaluate $$f_0 = f(x_0), \nabla f_0 = \nabla f(x_0)$$.

Set $$p_0 \leftarrow - \nabla f_0, k \leftarrow 0$$

Then while $$\nabla f_k \neq 0$$:

Carry out a line search to compute the next iterate, then evaluate $$\nabla f_{k+1}$$ and use this to determine the subsequent conjugate direction $$p_{k+1} = - \nabla f(x_{k+1}) + \beta_k p_k$$

Different variations of the Conjugate Gradient algorithm use different formulas for $$\beta_k$$, for example:

Fletcher-Reeves: $$\beta_{k+1} = \frac{f_{k+1}^T \nabla f_{k+1}}{\nabla f_k^T \nabla f_k}$$ Polak-Ribiere: $$\beta_{k+1} = \frac{ \nabla f_{k+1}^T ( \nabla f_{k+1} - \nabla f_k)}{\|\nabla f_k\|^2}$$

• Considered to be one of the best general purpose methods.

• Faster convergence rate compared to Steepest Descent and only requires evaluation of objective function and it’s gradient - no matrix operations.

• For Fletcher-Reeves method it can be shown that if the method generates a bad direction and step, then the next direction and step are also likely to be bad. However, this is not the case with the Polak Ribiere method.

• Generally, the Polak Ribiere method is more efficient that the Fletcher-Reeves method but it has the disadvantage is requiring one more vector of storage.

[Nocedal]

## BFGS (bfgs)

Most popular quasi-Newton method, which uses an approximate Hessian rather than the true Hessian which is used in a Newton line search method.

Starting with an initial Hessian approximation $$H_0$$ and starting point $$x_0$$:

While $$\| \nabla f_k \| > \epsilon$$:

Compute the search direction $$p_k = -H_k \nabla f_k$$

Then find next iterate $$x_{k+1}$$ by performing a line search.

Next, define $$s_k = x_{k+1}-x_k$$ and $$y_k = \nabla f_{k+1} - \nabla f_k$$, then compute

$H_{k+1} = (I - \rho_k s_k y_k^T)H_k(I - \rho_k y_k s_K^T) + \rho_k s_k s_k^T$

with $$\rho_k = \frac{1}{y_k^T s_k}$$

• Superlinear rate of convergence

• Has self-correcting properties - if there is a bad estimate for $$H_k$$, then it will tend to correct itself within a few iterations.

• No need to compute the Jacobian or Hessian.

• Newton’s method has quadratic convergence but this is lost with BFGS.

[Nocedal]

## Gauss Newton (gauss_newton)

Modified Newton’s method with line search. Instead of solving standard Newton equations

$\nabla^2 f(x_k)p = -\nabla f(x_k),$

solve the system

$J_k^T J_k p_k^{GN} = - J_k^T r_k$

(where $$J_k$$ is the Jacobian) to obtain the search direction $$p_k^{GN}$$. The next iterate is then set as $$x_{k+1} = x_k + p_k^{GN}$$.

Here, the approximation of the Hessian $$\nabla^2 f_k \approx J_k^T J_k$$ has been made, which helps to save on computation time as second derivatives are not calculated.

• Calculation of second derivatives is not required.

• If residuals or their second order partial derivatives are small, then $$J_k^T J_k$$ is a close approximation to $$\nabla^2 f_k$$ and convergence of Gauss-Newton is fast.

• The search direction $$p_J^{GN}$$ is always a descent direction as long as $$J_k$$ has full rank and the gradient $$\nabla f_k$$ is nonzero.

• Without a good initial guess, or if the matrix $$J_k^T J_k$$ is ill-conditioned, the Gauss Newton Algorithm is very slow to converge to a solution.

• If relative residuals are large, then large amounts of information will be lost.

• $$J_k$$ must be full rank.

Nocedal(1,2,3,4,5,6)

Jorge Nocedal, Stephen J. Wright (2006), Numerical Optimization

Poczos

Barnabas Poczos, Ryan Tibshirani (2012), Lecture 10: Optimization, School of Computer Science, Carnegie Mellon University

Floater

Michael S. Floater (2018), Lecture 13: Non-linear least squares and the Gauss-Newton method, University of Oslo