The Polar Express Algorithm
In this note we explain how the Polar Express algorithm can be used to approximate the polar transform used in the Muon optimizer. Furthermore we show how similar ideas can be used to project matrices onto the cone of semidefinite matrices using only matrix-matrix products. The reason why such methods are of interest is that matrix-matrix products are very efficient to compute on modern hardware such as GPUs.
The Singular Value Transformation
The singular value transformation is defined for any odd function , as
The requirement of being odd means that the singular value transformation and a matrix function overlap when the matrix is square symmetric. To see this we start by using the eigendecomposition of as for which the following hold when is odd
where the sign function is defined as the odd function
Approximating the Polar Normalization
Polar normalization corresponds to the singular value transformation when itself is the sign function, i.e.
Recently there have been an increased interest in polar normalization as it is a key ingredient in the recently proposed Muon algorithm for optimizing large language models [1]. In short the steps of the Muon algorithm is [2]
A distinct disadvantage is that the polar step in the naive implementation requires the computation of an SVD which is computationally expensive and not very GPU friendly (it is primarily memory bound rather than compute bound). Recently the polar express algorithm have been proposed as a method for approximating the matrix sign function using just matrix-matrix products [3]. The key idea is rather simple: Approximate the function using a composition of polynomials e.g.
Given that the above is just polynomials it can be evaluated using just additions and multiplications.
Example: Newton-Schulz
The idea of approximating functions using repeated polynomial application is not new. An example is the Newton-Schulz algorithm which is based on repeatedly applying the same polynomial, i.e. , in order to approximate the sign function. For order 3 and 5 the polynomials look as follows
While the Newton-Schulz method converges for all the convergence is only fast for values close to . Reversely for values that are close to zero the convergence is slow. The slow convergence is exactly what the Polar Express method explained shortly aims to fix.
A note here is that for non-square matrices we can not simply do . Instead what is done is that
The Polar Express Algorithm
The basic question that the polar express algorithm aim to solve is as follows: Given a set of polynomials , can we find optimal set of coefficients that minimizes the maximum error over a given interval ? That is we want to solve
In [3] they show that the above is solved by a greedy approach, with a caveat that the convergence is only guaranteed if the smallest singular value is larger than , which we do not know in advance. A comment here is that the main focus in the paper is on Float16. Here the machine precision is . Therefore, they suggest to set and . In addition, they provide a few modifications in order to make the method numerically robust. These additions are skipped here, but can be found in the original paper.
The first step is then to find a polynomial approximation of the sign function on the interval for some small . That is we solve
Projection onto the semidefinite cone
The positive semidefinite cone is the set of all real symmetric positive semidefinite matrices. That is
Semidefinite programming (SDP) problems are optimization problems for which matrix variables are constrained to be positive semidefinite. As such algorithms that aim to solve SDPs need to ensure that the solution satisfy the positive semidefinite constraint. In interior point methods this is done by only taking steps for which the next iterate remain inside the cone, while first order methods most often rely on projecting back onto the set after taking a step, i.e.
While the idea of projecting back onto the semidefinite cone using just matrix-matrix products is not new (a similar fixed-point approach was suggested in [4]) the main takeaway is that the ideas of Polar Express can be used to find a faster converging method [5]. An initial idea would be to directly find polynomials that approximate the ReLU function, i.e. solving
where is the total number of polynomials in the composition, is the degree of the ‘th polynomial which are fixed in advance as part of a “matrix-matrix” multiplication budget. Unfortunately directly solving the above is challenging, and instead a two-stage approach is proposed. The idea is to realize that we can write the ReLu function
The first step is to compute optimal coefficients for approximating the sign function using the Polar Express approach. The second step is then to refine the coefficients of the polynomials found in the first step by minimizing the error to the ReLu function on the entire interval i.e. minimizing the following loss function
Bibliography
- [1] K. Jordan et al., “Muon: An optimizer for hidden layers in neural networks.” [Online]. Available: https://kellerjordan.github.io/posts/muon/
- [2] J. Bernstein, “Modular Manifolds,” Thinking Machines Lab: Connectionism, 2025, doi: 10.64434/tml.20250926.
- [3] N. Amsel, D. Persson, C. Musco, and R. M. Gower, “The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm.” [Online]. Available: https://arxiv.org/abs/2505.16932
- [4] J. B. Francisco and D. S. Gonçalves, “A fixed-point method for approximate projection onto the positive semidefinite cone,” Linear Algebra and its Applications, vol. 523, pp. 59–78, 2017, doi: https://doi.org/10.1016/j.laa.2017.02.014.
- [5] S. Kang, H. Han, A. Groudiev, and H. Yang, “Factorization-free Orthogonal Projection onto the Positive Semidefinite Cone with Composite Polynomial Filtering.” [Online]. Available: https://arxiv.org/abs/2507.09165