The BFGS algorithm is one of the most used Quasi-Newton algorithms. This note briefly explores why the update of the approximation of the inverse Hessian has the form that it has. I've done so because I have found that most texts lacks this intuition and instead just states the update without any reasoning.
In the original papers (there were four seperate) the approach was to looking at the following minimization problem
\[ \underset{H_{k+1}\in\mathbb{R}^{n \times n}}{\text{minimize}} \|H_{k+1}^{-1} - H_{k}^{-1}\|_W\\ \text{subject to } H_{k+1}^{-1}\left(\nabla f(x_{k+1} - \nabla f(x_k))\right) = \left(x_{k+1} - x_k\right)\ \text{ and }\ H_{k+1}^\top = H_{k+1} \]In short they looked at finding the symmetric matrix that minimized some weighted norm distance to the previous Hessian approximation while constraining it to the secant condition (described in the next section). In this note we will take a different approach, namely the one of rank-2 updates.
A key component of the derivation of the BFGS steps is the secant condition stating that
\[ H_{k+1} s_k = y_k \]where \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\) and \(H_{k+1}\) is to be determined by the condition.
Now define a quadratic approximation of \(f\) around \(x_{k+1}\) as
\[ \tilde{f}(x) = f(x_{k+1}) + \nabla f(x_{k+1})^\top (x-x_{k+1}) + \frac{ 1 }{ 2 } (x - x_{k+1})^\top H_{k+1}(x - x_{k+1}), \]then
by construction we have that \(\nabla \tilde{f}(x_{k+1}) = \nabla f(x_{k+1})\)
while the secant condition ensures that also \(\nabla \tilde{f}(x_k) = \nabla f(x_k)\).
In practical terms the secant conditions states how to construct a "Hessian" at point \(x_{k+1}\) such that the quadratic approximation of \(f\) with origin at \(x_{k+1}\) using said "Hessian" have a gradient equal to the gradient of \(f\) at \(x_k\) (and \(x_{k+1}\) by construction). Visually we can view the secant condition as.
The main idea behind the BFGS algorithm is a simple (symmetric) rank-2 update of a proposed Hessian (note that because \(H_k\) is a proposed Hessian it must be symmetric)
\[\begin{aligned} H_{k + 1} &= H_k + \alpha uu^\top + \beta vv^\top \\ &= H_k + \begin{bmatrix} \alpha u & \beta v \end{bmatrix}\begin{bmatrix} u^\top \\ v^\top \end{bmatrix}. \end{aligned}\]Given that \(H_{k+1}\) is simply \(H_k\) with a rank-\(2\) update we know from the Sherman-Morrison-Woodbury identity that it is possible to efficiently compute the inverse of \(H_{k+1}\) if the inverse of \(H_{k}\) is already known
\[ \left( H_{k + 1} \right)^{-1} = H_k^{-1} - H_{k}^{-1}\begin{bmatrix} \alpha u & \beta v \end{bmatrix} \left(I + \begin{bmatrix} u^\top \\ v^\top \end{bmatrix} H_{k}^{-1} \begin{bmatrix} \alpha u & \beta v \end{bmatrix} \right)^{-1}\begin{bmatrix} u^\top \\ v^\top \end{bmatrix}H_k^{-1}, \]note that we have here used the inverse of the proposed Hessians are also symmetric.
In short the above states that we in practice only need to compute the full inverse on the first iteration while at any other point we can settle for just updating it through an inversion of a \(2\times 2\) matrix and a few matrix-vector products. While the idea is simple we still need to determine suitable \(\alpha, \beta, u\) and \(v\).
We start by imposing the secant condition
\[\begin{aligned} H_{k+1}s_k = H_ks_k + \alpha uu^\top + \beta vv^\top = y_k. \end{aligned}\]Since the middle expression has to be equal to \(y_k\) we can assume that one of the terms \(\alpha uu^\top \) or \(\beta vv^\top \) should be equal to \(y_k\) while the other should be equal to \(-H_ks_k\) (in order to counteract the \(+H_ks_k\) term). An obvious idea could therefore be to set \(u = y_k\) and \(v = H_ks_k\) and determine \(\alpha\) and \(\beta\) such that that the expression is equal to \(y_k\). We start by inserting the choices of \(u\) and \(v\)
\[\begin{aligned} H_{k+1}s_k = H_ks_k + \underbrace{\alpha y_ky_k^\top s_k}_{\text{only term with $y_k$}} + \underbrace{\beta H_ks_ks_k^\top H_ks_k}_{\text{only term with $H_k$}} = y_k. \end{aligned}\]From this we see that we need the \(\alpha\)-term to be equal to \(y_k\) and the \(\beta\)-term to be equal to \(-H_ks_k\). This can be achieved by setting
\[ \alpha = \frac{1}{y_k^\top s_k}, \quad \text{and} \quad \beta = -\frac{1}{s_k^\top H_ks_k}. \]From this it follows that \(H_{k+1}\) will have the form
\[ H_{k+1} = H_k + \frac{y_ky_k^\top }{y_k^\top s_k} - \frac{H_ks_ks_k^\top H_k}{s_k^\top H_ks_k}, \]which is also how it is stated in most texts. Note, however, that in practice we will never use the above relation but instead invert \(H_{k+1}\) directly using (5) so that we spare the memory cost of actually storing \(H_k\) and \(H_{k+1}\).
While this is easily stated it is not completely obvious how this can be achieved, since we need \(H_k\) when computinbg \(\beta\). However in the following section we will show, through some tedious computations, that we actually do not need to directly compute \(\beta\) at all in order for us to use the update described in (9).
Earlier we stated how the computation of the inverse could be reduced to a simple inversion of a \(2\times 2\) matrix and matrix-vector products. However we later set \(v=H_ks_k\), meaning that we also needed to store \(H_k\) explicitly. Thankfully we can see that if we insert our chosen values of \(\alpha,\beta, u\) and \(v\) into the expression for the inversion we get something that does not explicitly depend on \(H_k\). The calculations are tedious and not nessecarily informing, however, given that these are notes I have chosen to include them anyways.
First inserting the expression we find that
\[\begin{aligned} \left( H_{k + 1} \right)^{-1} &= H_k^{-1} - \begin{bmatrix} \alpha H_{k}^{-1}y_k & \beta s_k \end{bmatrix} \left(I + \begin{bmatrix} y_k^\top \\ s_k^\top H_k \end{bmatrix}\begin{bmatrix} \alpha H_k^{-1}y_k & \beta s_k \end{bmatrix} \right)^{-1}\begin{bmatrix} y_k^\top H_k^{-1} \\ s_k^\top \end{bmatrix} \end{aligned}\]Now for the easy of notation introduce \(\gamma_k = H_k^{-1}y_k\)
\[\begin{aligned} \left( H_{k + 1} \right)^{-1} &= H_k^{-1} - \begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \left(I + \begin{bmatrix} \alpha y_k^\top \gamma_k & \beta y_k^\top s_k\\ \alpha s_k^\top y_k & \beta s_k^\top H_ks_k\end{bmatrix} \right)^{-1}\begin{bmatrix} \gamma_k^\top \\ s_k^\top \end{bmatrix} \\ &= H_k^{-1} - \begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \left(I + \begin{bmatrix} \alpha y_k^\top \gamma_k & \beta \alpha^{-1}\\ 1 & -1\end{bmatrix} \right)^{-1}\begin{bmatrix} \gamma_k^\top \\ s_k^\top \end{bmatrix}, \end{aligned}\]where we used that \(\alpha s_k^\top y_k = 1\) and \(\beta s_k^\top H_ks_k = -1\). One could be tempted to stop here, as the expression does not look to depend on \(H_k\). However one needs to remember that \(\beta\) is defined using \(H_k\). We must therefore continue
\[\begin{aligned} \left( H_{k + 1} \right)^{-1} &= H_k^{-1} - \begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \left(\begin{bmatrix} 1 + \alpha y_k^\top \gamma_k & \beta \alpha^{-1}\\ 1 & 0\end{bmatrix} \right)^{-1}\begin{bmatrix} \gamma_k^\top \\ s_k^\top \end{bmatrix} \\ &= H_k^{-1} - \begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \frac{-\alpha}{\beta}\begin{bmatrix} 0 & -\beta \alpha^{-1}\\ -1 & 1 + \alpha y_k^\top \gamma_k\end{bmatrix} \begin{bmatrix} \gamma_k^\top \\ s_k^\top \end{bmatrix} \\ &= H_k^{-1} + \frac{\alpha}{\beta}\begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \begin{bmatrix} -\beta \alpha^{-1}s_k^\top \\ -\gamma_k^\top + s_k^\top + \alpha y_k^\top \gamma_ks_k^\top \end{bmatrix}\\ &= H_k^{-1} + \frac{\alpha}{\beta}\begin{bmatrix} \alpha \gamma_k & \beta s_k \end{bmatrix} \begin{bmatrix} -\beta \alpha^{-1}s_k^\top \\ -\gamma_k^\top + s_k^\top + \alpha y_k^\top \gamma_ks_k^\top \end{bmatrix}\\ &=H_k^{-1} + \frac{\alpha}{\beta}\left( -\beta \gamma_ks_k^\top - \beta s_k\gamma_k^\top + \beta s_ks_k^\top + \beta\alpha s_ky_k^\top \gamma_ks_k^\top \right)\\ &= H_k^{-1} + \left( -\alpha \gamma_ks_k^\top - \alpha s_k\gamma_k^\top + \alpha s_ks_k^\top + \alpha^2 s_ky_k^\top \gamma_ks_k^\top \right)\\ &= H_k^{-1} + (\alpha + \alpha^2y_k^\top \gamma_k)s_ks_k^\top - \alpha(\gamma_ks_k^\top + s_k\gamma_k^\top ). \end{aligned}\]As promised we see that the update of the inverse does not explictly rely on \(H_k\) (but instead just \(H_k^{-1}\)). The expression is usually stated with \(\gamma_k = H_k^{-1}y_k\) reinserted. However I find that a little misleading given that a good implementation would have have a temporary variable like \(\gamma_k\) computed instead of having multiple matrix-vector products. On a last note I want to stress that here we can also clearly see that the update will continue to be symmetric if the initial matrix \(H_0\) is symmetric. The reason for this is simply that \((\alpha + \alpha^2y_k^\top \gamma_k)\) is just a scalar and \((\gamma_ks_k^\top + s_k\gamma_k^\top )\) is the sum of a matrix and its transpose which is clearly symmetric.