and why its easier than people make it out to be
In many engineering applications we are interested in solving equality constrained optimization problems where the optimization variable ($\theta$) implicitly defines another variable ($u$) that then appears in the objective function
Most methods for solving equality constrained optimization problems requires the computation of gradient of the objective function with respect to the optimization variable $\theta$. This is where the problem comes in: Naively computing the gradient requires the computation of the sensitivities of the implicitly defined variable $u$ with respect to the optimization variable $\theta$ $\left(\text{i.e.}\ \frac{\mathrm{d}u}{\mathrm{d}\theta} \in \mathbb{R}^{n_u \times n_\theta}\right)$. This result in a computational bottleneck as forming the sensitivities scales as $\mathcal{O}(n_un_\theta)$, meaning that adding a new parameter adds $n_u$ additional sensitivities (and one adding one more variable would add $n_\theta$ additional sensitivities).
To resolve this computational bottleneck we can make use the adjoint method. The first step in the derivation of the adjoint method is to introduce the Lagrangian of the objective function, i.e.
\[\begin{equation} \mathcal{L}(u,\theta,\lambda) = L(u,\theta) + \lambda^\top f(u,\theta). \end{equation}\]It is easy to see that whenever $u$ satisfy the equality constraint $f(u,\theta)=0$, then the Lagrangian is equal to the original objective function. This in turn mean that the two gradients $\nabla_\theta L(u,\theta,\lambda)$ and $\nabla_\theta \mathcal{L}(u,\theta,\lambda)$ are equal whenever $u$ satisfy the equality constraint $f(u,\theta) = 0$. The reason why the introduction of the additional term in the Lagrangian is useful is that $\lambda$ can be set to be anything. In particular, we aim to set in a way so that we can avoid computing the otherwise computational expensive sensitivities $\frac{\mathrm{d}u}{\mathrm{d}\theta}$. We start by computing the total derivative of the Lagrangian with respect to $\theta$
Now we collect the terms that depend on the undesired $\frac{\mathrm{d}u}{\mathrm{d}\theta}$ result in
\[\begin{equation} \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}\theta} = \frac{\partial L}{\partial\theta} + \lambda^\top\frac{\partial f}{\partial\theta} + \left(\frac{\partial L}{\partial u} + \lambda^\top\frac{\partial f}{\partial u}\right)\frac{\mathrm{d}u}{\mathrm{d}\theta} . \end{equation}\]As we can choose $\lambda$ freely a natural idea is to set it in a way such that the term in front of the undesired term vanishes. This means that we can choose $\lambda$ as the solution to the equation
\[\begin{equation} \frac{\partial L}{\partial u} + \lambda^\top\frac{\partial f}{\partial u} = 0 \quad \Rightarrow \quad \lambda^\top = -\frac{\partial L}{\partial u}\left(\frac{\partial f}{\partial u}\right)^{-1}. \end{equation}\]Inserting $\lambda^\top$ back into the equation we find that the gradient of the Lagrangian with respect to $\theta$ is given by
\[\begin{equation} \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}\theta} = \frac{\partial L}{\partial\theta} \underbrace{- \frac{\partial L}{\partial u}\left(\frac{\partial f}{\partial u}\right)^{-1}}_{\lambda^\top}\frac{\partial f}{\partial\theta} \left(= \frac{\mathrm{d}L}{\mathrm{d}\theta}\right). \end{equation}\]To conclude: The adjoint method is a simple way to avoid the computational bottleneck of computing the sensitivities $\frac{\mathrm{d}u}{\mathrm{d}\theta}$ by cleverly computing the so-called adjoint variable $\lambda$ in a way that eliminates the need to compute the problematic sensitivities.
In order to illustrate the adjoint method we consider a simple linearly constrained problem of the form (inspiration from problem 38 in
where $A(\theta) \in \mathbb{R}^{n\times n}$ is a symmetric tridiagonal matrix that depends on the parameters $\theta \in \mathbb{R}^{2n-1}$ as
\[A = \begin{bmatrix} \theta_1 & \theta_{n+1} & & & 0 \\ \theta_{n+1} & \theta_2 & \ddots & & \\ & \ddots & \ddots & \ddots & \\ & & \ddots & \theta_{n-1} & \theta_{2n-1} \\ 0 & & & \theta_{2n-1} & \theta_n \end{bmatrix}.\]Now in order to compute the gradient of interested we start by computing the adjoint variable $\lambda$ as
\[\lambda^\top = -\frac{\partial L}{\partial u}\left(\frac{\partial f}{\partial u}\right)^{-1} = -2\left(c^\top u\right)c^\top A(\theta)^{-1}.\]Note that in practice we do not form $A(\theta)^{-1}$ explicitly but rather compute $\lambda$ by solving the linear system $ A(\theta)^\top\lambda = -2c\left(c^\top u\right)$. Using the adjoint variable we can compute the gradient of $L$ with respect to $\theta$ as
\[\frac{\mathrm{d}L}{\mathrm{d}\theta} = \underbrace{\frac{\partial L}{\partial\theta}}_{=0} + \lambda^\top\frac{\partial f}{\partial\theta} = \lambda^\top\frac{\partial A}{\partial\theta}u.\]In the concrete case of the derivatives of $A$ with respect to $\theta_i$ we have that
\[\frac{\partial A}{\partial \theta_i} = \begin{bmatrix} \delta_{i,1} & \delta_{i,n+1} & & & 0 \\ \delta_{i,n+1} & \delta_{i,2} & \ddots & & \\ & \ddots & \ddots & \ddots & \\ & & \ddots & \delta_{i,n-1} & \delta_{i,2n-1} \\ 0 & & & \delta_{i,2n-1} & \delta_{i,n} \end{bmatrix}.\]Using this it follows that the $i$th component of the gradient can be computed as
\[\lambda^\top\frac{\partial A}{\partial \theta_i}u = \begin{cases} \lambda_i u_i, \quad &i \leq n, \\ \lambda_{i+1-n}u_{i-n} + \lambda_{i-n} u_{i+1-n}, \quad &i > n. \end{cases}\]using LinearAlgebra
n = 1000 # Number of linear constraints
c = rand(n) # Objective vector
b = rand(n) # Linear equality vector
θ = rand(2n-1) # Parameters of the equality constraint
# Parametrizing linear equality constraint matrix
A(θ) = SymTridiagonal(θ[1:n], θ[n+1:end])
# Objective function
f(θ) = (c' * (A(θ) \ b))^2
# Partial derivatives
partial(i,n,λ,u) = i <= n ? λ[i]*u[i] : λ[i+1-n]*u[i-n] + λ[i-n]*u[i+1-n]
# Gradient computation
function ∇f(θ)
n = length(b)
M = A(θ) # Defining constraint matrix
u = M \ b # Computing solution
λ = M \ (-2*c*(c'*u)) # Adjoint variables
return [partial(i,n,λ,u) for i = 1:2n-1]
end
# Testing against ForwardDiff
using Test, ForwardDiff
@test ForwardDiff.gradient(f,θ) ≈ ∇f(θ)