Schatten Norm Approximation
Randomized Numerical Linear Algebra is a gorgeous field that embraces chaos and whose results often seem wonderfully magical. My very first post on this subject explains a beautiful algorithm to approximate Schatten norms of a matrix.
In Numerical Linear Algebra, we are often interested in understanding the effect a linear map has on a (real) vector space. A map can rotate, shear, or dilate the space; flip our orientation; expand the space or compress it. Whatever does, to say anything intelligent about its behavior, we need to be able to quantify its effect too!
That’s where matrix norms come into the picture. You may be familiar with induced norms such as (maximum singular value), (maximum absolute column sum), (maximum absolute row sum), which capture a lot of information. , for example, tells you the largest amount of “stretching” is capable of as it transforms your space—to see this, it might help to go back to the definition of these norms: .
In this post, we’ll work with a different kind of matrix norm: the Schatten -norms, defined as follows: where is the rank of and is its -th singular value. When , for example, this is the familiar Frobenius norm; gives the spectral norm, which coincides with the induced -norm above. From here on, when I write , I’m referring to the Schatten -norm of , and not the induced norm.
One of the interesting facts about Schatten -norm is that they are invariant under rotation! That is, if I applied an orthogonal transformation to , wouldn’t change. Let’s see why: where is the transpose of its argument, and is the Singular Value Decomposition of . Indeed, because and are orthonormal matrices, too will be orthonormal, rendering the SVD of another matrix whose singular values match those of ’s. You can use a similar argument for transformations of from the right (i.e., where ).
We see why Schatten -norms are useful. Computing is also easy, if we are allowed to apply SVD to and extract its singular values. But, let’s say, we don’t have access to ; it’s being kept a secret by someone—let’s call this person the “oracle”—and they won’t let us apply SVD to the matrix! (It doesn’t have to be for nefarious reasons, it could simply be because is just way too large!) But we are allowed to transform any vector with and obtain the output, .
In what seems so magical, we can, in fact, approximate arbitrarily accurately by repeatedly asking for for random vectors ! The rest of this post explains how.
Approximating for symmetric
Suppose for now that and that it is symmetric—we will come back to the more general case in a bit. We want to obtain, with probability at least , a -approximation of for some valid choices of and . In other words, we want to find a quantity such that: Also, we’ll just focus on , rather than because it makes the math less cluttered.
Now to the main steps of the algorithm proposed by (Li et al., 2014). Repeat the following process times (for some constant to be defined later), indexed by :
- Initialize , drawn from an isotropic Gaussian distribution (whose covariance is the identity). This amounts to a vector whose entries are independent random variables sampled from the standard normal distribution.
- Let by asking the oracle for the computation, and repeat that process times.
- Take note of , where is inner product.
In the end, output the following quantity: as the desired approximation of .
That was just the (sweet and simple) algorithm. We left out what should be and, more importantly, why this procedure even works! So let’s get to it!
Before we do though, let’s quickly consider the complexity of the algorithm. If the number of non-zero entries in is , then each iteration of the algorithm simply runs in time. We repeat the algorithm times, so that its complexity adds up to .
Why does that even work?
We assumed that was symmetric. That implies that it can be decomposed into the following form: for some orthogonal matrix (so that ) and diagonal matrix . What happens when we take the inner product inside the sum in Equation (1)? It reduces to: where we used the definition of to get the first equality, the definition of inner product for the second, and the decomposition of for the third. In the last expression, is the rank of and subscripts indicate specific coordinates.
The output of the algorithm is the mean of ’s. Considering each sample gives Equation (2), and that , the expected value of is an unbiased estimate of , which is exactly .
Analysis of error
I promised we’ll get to at some point. Now that we have verified that the result of the algorithm is, in fact, an unbiased estimate of , it’s time to fill that gap.
Error analysis often begins with a derivation of or a bound on the variance. In this case, we should be looking at our random variable from Equation (2). It shouldn’t be too hard considering we’re dealing with iid Gaussian variables. So let’s do it:
So we know the variance of each sample of is less than that quantity. Great. We sample some times. Because the samples are independent, the variance scales down by a factor of .
Now we can finally get to our approximation error. We need to understand the maximum probability by which deviates from by a factor of at most . For that, we will appeal to standard results from concentration of measure. In particular, we simply apply Chebyshev’s inequality: to the random variable . That gives us:
We wanted the probability above to be at most . Setting the right-hand-side to , we get that . And there we have it!
The general case
We saw what we need to ask of the oracle to obtain an approximation of when is symmetric: We initialize a Gaussian vector, and repeatedly ask the oracle to return the matrix-vector product, which we then feed back to the oracle. Pretty straightforward. But what about the more general case where ?
It turns out, if we formed the following block matrix: then . It’s fairly easy to show this if you rewrite in terms of the SVD of :
We would have to update the algorithm with this logic to ensure it works for a general matrix .
Closing Remarks
I hope you found the elegance and simplicity of this randomized algorithm as compelling as I did and still do—ten years since its publication. There are numerous other algorithms that feel magical for norm approximation and other operations on vectors and matrices in this rich literature that I encourage you to study. There are a number of excellent monographs on the subject such as (Woodruff, 2014), (Murray et al., 2023), (Martinsson & Tropp, 2021) and references therein. I hope to highlight some of the more representative algorithms in future posts.
- Li, Y., Nguyen, H. L., & Woodruff, D. P. (2014). On sketching matrix norms and the top singular vector. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 1562–1581.
- Woodruff, D. P. (2014). Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2), 1–157. https://doi.org/10.1561/0400000060
- Murray, R., Demmel, J., Mahoney, M. W., Erichson, N. B., Melnichenko, M., Malik, O. A., Grigori, L., Luszczek, P., Dereziński, M., Lopes, M. E., Liang, T., Luo, H., & Dongarra, J. (2023). Randomized Numerical Linear Algebra : A Perspective on the Field With an Eye to Software. https://arxiv.org/abs/2302.11474
- Martinsson, P.-G., & Tropp, J. (2021). Randomized Numerical Linear Algebra: Foundations & Algorithms. https://arxiv.org/abs/2002.01387