backpropagation is the adjoint method

Backpropagation is the Adjoint Method

Backpropagation is often introduced as something developed within the field of machine learning. However, the reality is that backpropagation is just a special case of the adjoint method where the inversion of the partial derivative can be done efficiently due to its lower-triangular structure. This note is in two parts. In the first part we review the adjoint method while in the second part we describe how backpropagation is a special case of the adjoint method with a structure that result in scalable computational algorithms.

A key insight is that while the adjoint method is seemingly limited to equality constrained problems, it turns out that all computer programs are trivially equality constrained problems with equalities coming from each assignment¹¹ All lines of code are essentially “” which can be written as “”. In this sense, as shown in a later section, optimizing a neural network is actually an equality constrained optimization problem with constraints coming from each layer of the network. This is opposed to how optimizing a neural network is usually presented as an unconstrained optimization problem.

The Adjoint Method

In many engineering applications we are interested in solving equality constrained optimization problems where the optimization variable () implicitly defines another variable () that then appears in the objective function²² A famous example of this is Topology Optimization. Written out this means solving optimization problems of the following form

Most methods for solving equality constrained optimization problems requires the computation of gradient of the objective function with respect to the optimization variable . This is where the problem comes in: Naively computing the gradient requires the computation of the sensitivities of the implicitly defined variable with respect to the optimization variable . This result in a computational bottleneck as forming the sensitivities scales as , meaning that adding a new parameter adds additional sensitivities (and similarly adding one more variable would add additional sensitivities).

To resolve this computational bottleneck the adjoint method was introduced. The first step in the derivation of the adjoint method is to introduce the Lagrangian of the objective function, i.e.

It is easy to see that whenever satisfy the equality constraint , then the Lagrangian is equal to the original objective function. This in turn mean that the two gradients and are equal whenever satisfy the equality constraint . The reason why the introduction of the additional term in the Lagrangian is useful is that can be set to be anything. In particular, we aim to set in a way so that we can avoid computing the otherwise computational expensive sensitivities . We start by computing the total derivative of the Lagrangian with respect to ³³ The total derivative is the transpose of the gradient i.e. . That is multiplying the change in with the total derivative (rather than the transpose of the gradient) gives the change in the function .

Now we collect the terms that depend on the undesired result in

As we can choose freely a natural idea is to set it in a way such that the term in front of the undesired term vanishes. This means that we can choose as the solution to the equation

Inserting back into the equation we find that the gradient of the Lagrangian with respect to is given by

To conclude: The adjoint method is a simple way to avoid the computational bottleneck of computing the sensitivities by cleverly computing the so-called adjoint variable in a way that eliminates the need to compute the problematic sensitivities.

Example: Linearly constrained problem with structure

In order to illustrate the adjoint method we consider a simple linearly constrained problem of the form (inspiration from problem 38 in [1] )

where is a symmetric tridiagonal matrix that depends on the parameters as

Now in order to compute the gradient of interested we start by computing the adjoint variable as

Note that in practice we do not form explicitly but rather compute by solving the linear system . Using the adjoint variable we can compute the gradient of with respect to as

In the concrete case of the derivatives of with respect to we have that

Using this it follows that the th component of the gradient can be computed as

Below is a Julia implementation of the above example along with a test against ForwardDiff.jl to verify the correctness of the implementation.

using LinearAlgebra 
n = 1000        # Number of linear constraints
c = rand(n)     # Objective vector
b = rand(n)     # Linear equality vector
θ = rand(2n-1)  # Parameters of the equality constraint

# Parametrizing linear equality constraint matrix
A(θ) = SymTridiagonal(θ[1:n], θ[n+1:end]) 

# Objective function
f(θ) = (c' * (A(θ) \ b))^2 

# Partial derivatives
partial(i,n,λ,u) = i <= n ?  
  λ[i]*u[i] : λ[i+1-n]*u[i-n] + λ[i-n]*u[i+1-n]

# Gradient computation
function ∇f(θ)
    n = length(b)
    M = A(θ)                # Defining constraint matrix
    u = M \ b               # Computing solution
    λ = M \ (-2*c*(c'*u))   # Adjoint variables
    return [partial(i,n,λ,u) for i = 1:2n-1]
end

# Testing against ForwardDiff
using Test, ForwardDiff
@test ForwardDiff.gradient(f,θ) ≈ ∇f(θ)

Backpropagation and the Adjoint Method

Section 5.5 of [2] includes an example of how the adjoint method and backpropagation of neural networks are similar. In neural networks we are often interested in minimizing a loss function of the form

with and the notation “” is used in order to highlight that is a constant input (i.e. most often the “data”) and is some regularization function (e.g. ). Furthermore is a neural network with layers that given a set of constant inputs maps the parameters to an output. Now, the above objective functions does not include any equality constraints. However, one should realize that the a -layer neural network is nothing more than a series of composition of functions

Now using the notation that is the output of the th layer of the network we can write the propagation through the network as a large nonlinear system of equations that have to be satisfied

Using this notation (and removing the explicit dependence on ) we see that optimizing a neural network is really nothing more solving the following equality constrained optimization problem

We can now use the adjoint method to compute the gradient of with respect to . As a reminder this means that

where is referred to as the adjoint variable. For simplicity, we assume that is just a scalar, that is where is the ‘th canonical basis vector. In this case we have that

Now what is left to compute is and . Starting with we note because only propagates the ‘s forwards the resulting partial derivative is a lower block triangular matrix of the form

Now is a good place to stop and think of a key concept of backpropagation: The resulting computational graph should result in a Directed Acyclic Graph (DAG). This is indeed equivalent to stating that should be a lower triangular matrix. This is an important property as the adjoint method requires us to invert , which can be done cheaply for a lower triangular matrix. Even more importantly is that the diagonal blocks are the identity, meaning that forward/backward substitutions can be done without any matrix inversions at all.

What is left is to compute , which standardly is done as

Now, in case of the layers not sharing any parameters the matrix will have a block diagonal structure of the form

where are the parameters of layer .

Putting everything together we find that the gradient of with respect to is given by

As noted previously both and are structured, meaning that the above can be computed efficiently.

How to compute the elements of and

For a standard linear layer an activation function is usually applied element wise. In practice this means that we really should look at the rows of separately. For row th the gradients are easily seen to be

Using the above we see that

where is applied element wise. Now for we have to pick a way of how to vectorize the parameters. Here we choose

where vectorizes by stacking the columns of . Using the vectorization we can write the sought after sensitivity as

While the Kronecker above does result in a sparse matrix, it is even more structured. Using the standard property of Kronecker products, i.e. , where is the inverse operation of .

A final remark here is that the diagonal matrices that goes into the elements of and are the same. That is if we define

then it follows that

with the little unusual notation of means that is the lower block diagonal matrix.

Below is a Julia implementation of the above backpropagation/adjoint method for a fully connected feedforward neural network with scalar output. The implementation uses BlockBandedMatrices.jl and Kronecker.jl in order to efficiently represent the involved matrices.

# For BlockDiagonalMatrices.jl
# add https://github.com/mipals/BlockDiagonalMatrices.jl.git 
using BlockDiagonalMatrices
using BlockBandedMatrices
using LinearAlgebra
using SparseArrays
using ForwardDiff
using Kronecker
using Test

h(x)  =  exp(-x) # activation function
∇h(x) = -exp(-x) # derivative of activation function

# Forward pass including the derivatives
function forward_pass(u0,θ)
    x0 = u0
    diags = empty([first(θ)[2]])
    krons = empty([first(θ)[2]' ⊗ I(2) ])
    for (W,b) in θ
        push!(krons,[x0; 1]' ⊗ I(length(b))) # Lazy Kronecker producect using Kronecker.jl
        tmp = W*x0 + b  # Can be used for both forward pass and derivative pass
        x0 = h.(tmp)
        push!(diags, ∇h.(tmp))
    end
    return krons, diags, x0
end
# Block backsubstitution: Solving L^(-T)y = b
function backsub(dblks,wblks,b)
    y  = convert.(eltype(wblks[1]), copy(b))
    j0 = length(b)
    i0 = length(b) - size(wblks[end],1)
    @views for (D,blk) in (zip(reverse(dblks),reverse(Transpose.(wblks))))
        i1,j1 = size(blk)
        tmp = D .* y[j0-j1+1:j0]
        mul!(y[i0-i1+1:i0], blk, tmp, 1, 1)
        j0 -= j1
        i0 -= i1
    end
    return y
end
# Helper function thats packs a vector θ into Ws and bs
function pack_θ(θ, Ws_sizes, bs_sizes)
    We = empty([ones(eltype(θ),1,1)])
    be = empty([ones(eltype(θ),1)])
    i = 1
    for (W_size,b_size) in zip(Ws_sizes, bs_sizes)
        j = i+prod(W_size)
        push!(We, reshape(θ[i:j-1], W_size...))
        i = j + b_size
        push!(be, θ[j:i-1])
    end
    return We, be
end
# Evaluating f using the forward pass
function eval_f(θ, Ws_sizes, bs_sizes, u0)
    We,be = pack_θ(θ, Ws_sizes, bs_sizes)
    _,_,uN = forward_pass(u0, zip(We,be))
    return uN
end
# Gradient computation using the adjoint method
function ∇f(θ, Ws_sizes, bs_sizes, u0; y=0.0)
    # First pack vector parameters to matrices
    We,be = pack_θ(θ, Ws_sizes, bs_sizes)
    # Forward parse includes derivative information
    krons, ddiags, uN = forward_pass(u0, zip(We,be))
    # We here use that L' = I - blkdiag(W_1,..., W_N)D' and M^T = D*K 
    D = Diagonal(vcat(ddiags...))
    K = BlockDiagonal(krons)
    g = zeros(eltype(θ), sum(w_sizes -> w_sizes[1], Ws_sizes))
    g[end] = 2*(uN[1] - y) # uN is a 1x1 matrix so extract the scalar
    # The final step is to evaluate the gradient from the right
    grad_adjoint = (backsub(ddiags,We[2:end],g')*D)*K
    return grad_adjoint'
end
# Forward difference to test the adjoint gradient implementation
function fd(θ, Ws_sizes,bs_sizes,u0,i;e=1e-6,y=2.0)
    f0 = eval_f(θ, Ws_sizes, bs_sizes, u0)
    θ[i] += e
    f1 = eval_f(θ, Ws_sizes, bs_sizes, u0)
    θ[i] -= e
    return sum(((f1[1] - y)^2 - (f0[1] - y)^2)/e)
end

# Setting up parameters 
layer_sizes = [50,40,30,20,10,1]
N = length(layer_sizes) - 1
init(sizes...) = 0.01*randn(sizes...)
Ws = [init(layer_sizes[i+1],layer_sizes[i]) for i=1:N]
bs = [init(layer_sizes[i+1]) for i = 1:N]
u0 = init(layer_sizes[1],1)[:]
θ  = zip(Ws,bs)

Ws_sizes = size.(Ws)
bs_sizes = length.(bs)

# First we compute the forward pass
θvec = vcat([[W[:]; b] for (W,b) in θ]...)

## Testing the gradient
y = 3.0 # We aim to have the final output be 3.0
grad_adjoint = ∇f(θvec, Ws_sizes, bs_sizes, u0;y=y)
idx = 4000
@test fd(θvec, Ws_sizes,bs_sizes,u0,idx;e=1e-5,y=y) ≈ grad_adjoint[idx] atol=1e-6

# Optimizing to get the output y
for iter = 1:1000
    grad = ∇f(θvec, Ws_sizes, bs_sizes, u0; y=y) # Y is the output value we want
    θvec -= 0.001*grad
end
@test eval_f(θvec,Ws_sizes,bs_sizes,u0)[1] ≈ y
@test ∇f(θvec, Ws_sizes, bs_sizes, u0;y=y) ≈ zeros(length(θvec)) atol=1e-10

Bibliography

[1] P. Bright, A. Edelman, and S. G. Johnson, “Matrix Calculus (for Machine Learning and Beyond).” [Online]. Available: https://arxiv.org/abs/2501.14787
[2] A. Edelman, E. Akyürek, and Y. Wang, “Backpropagation through Back Substitution with a Backslash,” SIAM Journal on Matrix Analysis and Applications, vol. 45, no. 1, pp. 429–449, 2024, doi: 10.1137/22M1532871.