Schatten Norm Approximation

Randomized Numerical Linear Algebra is a gorgeous field that embraces chaos and whose results often seem wonderfully magical. My very first post on this subject explains a beautiful algorithm to approximate Schatten norms of a matrix.

In Numerical Linear Algebra, we are often interested in understanding the effect a linear map ARm×nA \in \mathbb{R}^{m \times n} has on a (real) vector space. A map can rotate, shear, or dilate the space; flip our orientation; expand the space or compress it. Whatever AA does, to say anything intelligent about its behavior, we need to be able to quantify its effect too!

That’s where matrix norms come into the picture. You may be familiar with induced norms such as A2\lVert A \rVert_2 (maximum singular value), A1\lVert A \rVert_1 (maximum absolute column sum), A\lVert A \rVert_\infty (maximum absolute row sum), which capture a lot of information. A2\lVert A \rVert_2, for example, tells you the largest amount of “stretching” AA is capable of as it transforms your space—to see this, it might help to go back to the definition of these norms: Ap=supup=1Aup\lVert A \rVert_p = \sup_{\lVert u \rVert_p = 1} \lVert Au \rVert_p.

In this post, we’ll work with a different kind of matrix norm: the Schatten pp-norms, defined as follows: App=1iρσip(A),\lVert A \rVert_p^p = \sum_{1 \leq i \leq \rho} \sigma^p_i(A), where ρ\rho is the rank of AA and σi()\sigma_i(\cdot) is its ii-th singular value. When p=2p=2, for example, this is the familiar Frobenius norm; p=p=\infty gives the spectral norm, which coincides with the induced 22-norm above. From here on, when I write Ap\lVert A \rVert_p, I’m referring to the Schatten pp-norm of AA, and not the induced norm.

One of the interesting facts about Schatten pp-norm is that they are invariant under rotation! That is, if I applied an orthogonal transformation RRm×mR \in \mathbb{R}^{m \times m} to AA, Ap\lVert A \rVert_p wouldn’t change. Let’s see why: RApp=RUΣVpp=QΣVBpp=1iρσip=App,\begin{aligned} \lVert RA \rVert_p^p &= \lVert R U \Sigma V^\ast \rVert_p^p = \lVert \underbrace{Q \Sigma V^\ast}_{B} \rVert_p^p = \sum_{1 \leq i \leq \rho} \sigma^p_i = \lVert A \rVert_p^p, \end{aligned} where \cdot^\ast is the transpose of its argument, and A=UΣVA=U \Sigma V^\ast is the Singular Value Decomposition of AA. Indeed, because RR and UU are orthonormal matrices, Q=RUQ=RU too will be orthonormal, rendering QΣVQ\Sigma V^\ast the SVD of another matrix BB whose singular values match those of AA’s. You can use a similar argument for transformations of AA from the right (i.e., ARAR where RRn×nR \in \mathbb{R}^{n \times n}).

We see why Schatten pp-norms are useful. Computing Ap\lVert A \rVert_p is also easy, if we are allowed to apply SVD to AA and extract its singular values. But, let’s say, we don’t have access to AA; it’s being kept a secret by someone—let’s call this person the “oracle”—and they won’t let us apply SVD to the matrix! (It doesn’t have to be for nefarious reasons, it could simply be because AA is just way too large!) But we are allowed to transform any vector uu with AA and obtain the output, AuAu.

In what seems so magical, we can, in fact, approximate Ap\lVert A \rVert_p arbitrarily accurately by repeatedly asking for AuAu for random vectors uu! The rest of this post explains how.

Approximating Ap\lVert A \rVert_p for symmetric AA

Suppose for now that ARd×dA \in \mathbb{R}^{d \times d} and that it is symmetric—we will come back to the more general case in a bit. We want to obtain, with probability at least 1δ1 - \delta, a (1+ϵ)(1+\epsilon)-approximation of Ap\lVert A \rVert_p for some valid choices of δ\delta and ϵ\epsilon. In other words, we want to find a quantity XX such that: P[(1ϵ)AppX(1+ϵ)App]1δ.\mathbb{P}\Big[ (1 - \epsilon) \lVert A \rVert_p^p \leq X \leq (1+ \epsilon) \lVert A \rVert_p^p \Big] \geq 1 - \delta. Also, we’ll just focus on App\lVert A \rVert_p^p, rather than Ap\lVert A \rVert_p because it makes the math less cluttered.

Now to the main steps of the algorithm proposed by (Li et al., 2014). Repeat the following process T=C/ϵ2T = C / \epsilon^2 times (for some constant CC to be defined later), indexed by tt:

  • Initialize u:=u(0)N(0,Id×d)u := u^{(0)} \sim \mathcal{N}(\mathbf{0}, I_{d \times d}), drawn from an isotropic Gaussian distribution (whose covariance is the identity). This amounts to a vector whose dd entries are independent random variables sampled from the standard normal distribution.
  • Let u(i+1)Au(i)u^{(i+1)} \leftarrow Au^{(i)} by asking the oracle for the computation, and repeat that process p/2\lceil p/2 \rceil times.
  • Take note of X(t)u(p/2),u(p/2)X^{(t)} \leftarrow \langle u^{(\lfloor p/2 \rfloor)}, u^{(\lceil p/2 \rceil)} \rangle, where ,\langle \cdot, \cdot \rangle is inner product.

In the end, output the following quantity: 1T1tTX(t),(1) \frac{1}{T} \sum_{1 \leq t \leq T} X^{(t)}, \tag{1} as the desired approximation of App\lVert A \rVert_p^p.

That was just the (sweet and simple) algorithm. We left out what CC should be and, more importantly, why this procedure even works! So let’s get to it!

Before we do though, let’s quickly consider the complexity of the algorithm. If the number of non-zero entries in AA is nnz(A)\mathop{nnz}(A), then each iteration of the algorithm simply runs in O(pnnz(A))\mathcal{O}(p \cdot \mathop{nnz}(A)) time. We repeat the algorithm TT times, so that its complexity adds up to O(pnnz(A)ϵ2)\mathcal{O}(p \cdot \mathop{nnz}(A) \cdot \epsilon^{-2}).

Why does that even work?

We assumed that AA was symmetric. That implies that it can be decomposed into the following form: UΣUU \Sigma U^\ast for some orthogonal matrix UU (so that UU=IUU^\ast=I) and diagonal matrix Σ\Sigma. What happens when we take the inner product inside the sum in Equation (1)? It reduces to: X:=u(p/2),u(p/2)=Ap/2u,Ap/2u=uApu=uΣpu=1kρσkpuk2.(2)\begin{aligned} X := \langle u^{(\lfloor p/2 \rfloor)}, u^{(\lceil p/2 \rceil)} \rangle &= \langle A^{\lfloor p /2 \rfloor} u, A^{\lceil p /2 \rceil} u \rangle \\ &= u^\ast A^p u \\ &= u^\ast \Sigma^p u \\ &= \sum_{1 \leq k \leq \rho} \sigma_k^p u_k^2. \tag{2} \end{aligned} where we used the definition of u(i)u^{(i)} to get the first equality, the definition of inner product for the second, and the decomposition of AA for the third. In the last expression, ρ\rho is the rank of AA and subscripts indicate specific coordinates.

The output of the algorithm is the mean of X(t)X^{(t)}’s. Considering each sample X(t)X^{(t)} gives Equation (2), and that E[uk2]=Var[uk]=1\mathbb{E}[u_k^2] = \mathop{Var}[u_k] = 1, the expected value of XX is an unbiased estimate of σkp\sum \sigma_k^p, which is exactly App\lVert A \rVert_p^p.

Analysis of error

I promised we’ll get to CC at some point. Now that we have verified that the result of the algorithm is, in fact, an unbiased estimate of App\lVert A \rVert_p^p, it’s time to fill that gap.

Error analysis often begins with a derivation of or a bound on the variance. In this case, we should be looking at our random variable XX from Equation (2). It shouldn’t be too hard considering we’re dealing with iid Gaussian variables. So let’s do it: Var[X]E[X2]=Eu[(uΣpu)2]=j,kσjpσkp  E[uj2uk2]=jσj2p  E[uj4]3+jkσjpσkp  E[uj2]E[uk2]1=3jσj2p+jkσjpσkp4Ap2p.\begin{aligned} \mathop{Var}[X] &\leq \mathbb{E}[X^2] = \mathbb{E}_u[(u^\ast \Sigma^p u)^2] \\ &= \sum_{j, k} \sigma^p_j \sigma^p_k \; \mathbb{E}[u_j^2 u_k^2] \\ &= \sum_{j} \sigma^{2p}_j \; \underbrace{\mathbb{E}[u_j^4]}_3 + \sum_{j \neq k} \sigma^p_j \sigma^p_k \; \underbrace{\mathbb{E}[u_j^2] \mathbb{E}[u_k^2]}_1 \\ &= 3\sum_{j} \sigma^{2p}_j + \sum_{j \neq k} \sigma^p_j \sigma^p_k \\ &\leq 4 \lVert A \rVert_p^{2p}. \end{aligned}

So we know the variance of each sample of XX is less than that quantity. Great. We sample XX some TT times. Because the TT samples are independent, the variance scales down by a factor of TT.

Now we can finally get to our approximation error. We need to understand the maximum probability by which XX deviates from App\lVert A \rVert_p^p by a factor of at most (1+ϵ)(1 + \epsilon). For that, we will appeal to standard results from concentration of measure. In particular, we simply apply Chebyshev’s inequality: P[XE[X]ϵE[X]]Var[X]ϵ2E2[X], \mathbb{P}\Big[ \big\lvert X - \mathbb{E}[X] \big\rvert \geq \epsilon \mathbb{E}[X] \Big] \leq \frac{\mathop{Var}[X]}{\epsilon^2 \mathbb{E}^2[X]}, to the random variable Z=X(t)/TZ = \sum X^{(t)} / T. That gives us: P[ZAppϵApp]4Ap2pTϵ2Ap2p=4Tϵ2=4C. \mathbb{P}\Big[ \big\lvert Z - \lVert A \rVert_p^p \big\rvert \geq \epsilon \lVert A \rVert_p^p \Big] \leq \frac{4 \lVert A \rVert_p^{2p}}{T \epsilon^2 \lVert A \rVert_p^{2p}} = \frac{4}{T \epsilon^2} = \frac{4}{C}.

We wanted the probability above to be at most δ\delta. Setting the right-hand-side to δ\delta, we get that C=4/δC = 4/\delta. And there we have it!

The general case

We saw what we need to ask of the oracle to obtain an approximation of Ap\lVert A \rVert_p when AA is symmetric: We initialize a Gaussian vector, and repeatedly ask the oracle to return the matrix-vector product, which we then feed back to the oracle. Pretty straightforward. But what about the more general case where ARm×nA \in \mathbb{R}^{m \times n}?

It turns out, if we formed the following block matrix: A=[0AA0], A^\prime = \begin{bmatrix} 0 & A^\ast\\ A & 0 \end{bmatrix}, then Ap=21/pAp\lVert A^\prime \rVert_p = 2^{1/p} \lVert A \rVert_p. It’s fairly easy to show this if you rewrite AA^\prime in terms of the SVD of A=UΣVA=U\Sigma V^\ast: A=[0VΣUUΣV0]=[V00U][0ΣUΣV0]=[V00U][Σ00Σ][0UV0]. A^\prime = \begin{bmatrix} 0 & V \Sigma U^\ast \\ U \Sigma V^\ast & 0 \end{bmatrix} = \begin{bmatrix} V & 0 \\ 0 & U \end{bmatrix} \begin{bmatrix} 0 & \Sigma U^\ast \\ \Sigma V^\ast & 0 \end{bmatrix} = \begin{bmatrix} V & 0 \\ 0 & U \end{bmatrix} \begin{bmatrix} \Sigma & 0 \\ 0 & \Sigma \end{bmatrix} \begin{bmatrix} 0 & U^\ast \\ V^\ast & 0 \end{bmatrix}.

We would have to update the algorithm with this logic to ensure it works for a general matrix AA.

Closing Remarks

I hope you found the elegance and simplicity of this randomized algorithm as compelling as I did and still do—ten years since its publication. There are numerous other algorithms that feel magical for norm approximation and other operations on vectors and matrices in this rich literature that I encourage you to study. There are a number of excellent monographs on the subject such as (Woodruff, 2014), (Murray et al., 2023), (Martinsson & Tropp, 2021) and references therein. I hope to highlight some of the more representative algorithms in future posts.

  1. Li, Y., Nguyen, H. L., & Woodruff, D. P. (2014). On sketching matrix norms and the top singular vector. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 1562–1581.
  2. Woodruff, D. P. (2014). Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2), 1–157. https://doi.org/10.1561/0400000060
  3. Murray, R., Demmel, J., Mahoney, M. W., Erichson, N. B., Melnichenko, M., Malik, O. A., Grigori, L., Luszczek, P., Dereziński, M., Lopes, M. E., Liang, T., Luo, H., & Dongarra, J. (2023). Randomized Numerical Linear Algebra : A Perspective on the Field With an Eye to Software. https://arxiv.org/abs/2302.11474
  4. Martinsson, P.-G., & Tropp, J. (2021). Randomized Numerical Linear Algebra: Foundations & Algorithms. https://arxiv.org/abs/2002.01387