Oblivious Subspace Embedding

In an earlier post, we saw how the Johnson-Lindenstrauss Transform can reduce the dimensionality of a set of m points while preserving their pairwise distances. That's neat, but what if we need to preserve the pairwise distance between all (infinitely many) points in an entire subspace? This post introduces the concept of subspace embedding, which answers that question.

Say we have a matrix ARn×dA \in \mathbb{R}^{n \times d} where ndn \gg d. If a linear map SRk×nS \in \mathbb{R}^{k \times n}, where knk \ll n, has the property that, for all xRdx \in \mathbb{R}^d and some ϵ(0,1)\epsilon \in (0, 1), the following holds:

(1ϵ)Ax2SAx2(1+ϵ)Ax2, (1 - \epsilon) \lVert Ax \rVert_2 \leq \lVert SAx \rVert_2 \leq (1+\epsilon) \lVert Ax \rVert_2,

then we call SS an 2\ell_2-subspace embedding of the column space of AA, or simply an 2\ell_2-subspace embedding of AA. Essentially, SS preserves norms of the column space of AA up to ϵ\epsilon amount of error. To simplify notation, I’ll write the above expression as: SAx2=(1±ϵ)Ax2\lVert SAx \rVert_2 = (1 \pm \epsilon) \lVert Ax \rVert_2.

We can, without loss of generality, assume that AA has orthonormal columns. That is because, if that’s not the case, we can redefine A:=UA:=U where UU comes from the SVD of A=UΣVA=U\Sigma V^\ast. Indeed, the sets {y    y=Ax,  xRd}\{ y \;|\; y = Ax,\; x \in \mathbb{R}^d \} and {y    y=Uz,  zRρ}\{ y \;|\; y=Uz, \; z \in \mathbb{R}^\rho \} where ρ=rank(A)\rho=\mathop{rank}(A) are the same.

What that means is that, we can simplify the condition to: SAx2=(1±ϵ)x2\lVert SAx \rVert_2 = (1 \pm \epsilon) \lVert x \rVert_2. We can also require that the above hold for unit vectors xx due to the linearity of the maps involved. That is in turn equivalent to IASSA2ϵ\lVert I - A^\ast S^\ast S A \lVert_2 \leq \epsilon where 2\lVert \cdot \rVert_2 is the operator norm of its argument. To see this:

SAx2=(1±ϵ)x2x(ASSAI)xϵx2supx2=1x(ASSAI)xϵsupx2=1xΣxϵσ1(ASSAI)ϵ \begin{aligned} \lVert S&A x \rVert_2 = (1 \pm \epsilon) \lVert x \rVert_2 \Rightarrow \\ & \big\lvert x^\ast \big( A^\ast S^\ast S A - I \big) x \big\rvert \leq \epsilon \lVert x \rVert_2 \Rightarrow \\ & \sup_{\lVert x \rVert_2 = 1} \big\lvert x^\ast \big( A^\ast S^\ast S A - I \big) x \big\rvert \leq \epsilon \Rightarrow \\ & \sup_{\lVert x \rVert_2 = 1} \big\lvert x^\ast \Sigma x \big\rvert \leq \epsilon \Rightarrow \\ & \sigma_1(A^\ast S^\ast S A - I) \leq \epsilon \end{aligned}

where we used the SVD (ASSAI)=UΣV(A^\ast S^\ast SA - I)=U \Sigma V^\ast and the fact that UU and VV have orthonormal columns.

What properties should SS have? Well, first and foremost, we want knk \ll n so that SASA has as few rows as possible. That is, we want the smallest subspace of the column space of AA that offers a (1±ϵ)(1 \pm \epsilon)-approximation of the Euclidean norm. Additionally, we want to be able to compute SASA rather quickly. That often means we prefer a sparser SS.

Definition

Suppose that we have a distribution Π\Pi over the space of all k×nk \times n linear maps, where kk is a function of n,d,ϵ,δn, d, \epsilon, \delta, with the following property: For every SΠS \sim \Pi and any fixed matrix ARn×dA \in \mathbb{R}^{n \times d}, with probability at least 1δ1 - \delta, SS is a (1±ϵ)(1 \pm \epsilon) 2\ell_2-subspace embedding of AA. Then we call Π\Pi an (ϵ,δ)(\epsilon, \delta) oblivious 2\ell_2-subspace embedding.

As a side-note, in addition to oblivious embeddings, we can design sampling-based sketching algorithms that are called Leverage Score Sampling. I’ll discuss these in a later post.

Construction

Theorem: Let S=1kRRk×nS=\frac{1}{\sqrt{k}}R \in \mathbb{R}^{k \times n} where the entries of RR are iid standard normal random variables. If k=Θ(ϵ2(d+log(1/δ)))k = \Theta\big(\epsilon^{-2} (d + \log(1/\delta)) \big), then SS is a (1±ϵ)(1 \pm \epsilon) 2\ell_2-subspace embedding for any matrix ARn×dA \in \mathbb{R}^{n \times d} with probability at least 1δ1 - \delta.

Proof: Notice that, we are trying to find an 2\ell_2-subspace embedding for the column space of AA. We can also focus only on unit vectors in the column space, due to linearity. The idea behind the proof is to first lay an ϵ\epsilon-net over the column space of AA with the property that, a Johnson-Lindenstrauss Transform for the ϵ\epsilon-net is a 2\ell_2-subspace embedding for the entire space. Let’s get to it.

Consider the set S={y    y=Ax,  y2=1}\mathcal{S} = \{ y \;|\; y = Ax,\; \lVert y \rVert_2 = 1 \} and its 1/21/2-net N\mathcal{N}. We can write every ySy \in S as the series: y=i=0wiy = \sum_{i=0}^{\infty} w_i where wiw_i is a scaled member of N\mathcal{N} with the property that wi2i\lVert w_i \rVert \leq 2^{-i}. To see that that’s always possible, take w0:=v0Nw_0 := v_0 \in \mathcal{N} and write y=w0+(yw0)y = w_0 + (y - w_0). Clearly yw021/2\lVert y - w_0 \rVert_2 \leq 1/2 because w0Nw_0 \in \mathcal{N}. Now write y=w0+w1+(yw0w1)y = w_0 + w_1 + (y - w_0 - w_1) where w1=yw02v1w_1 = \lVert y - w_0 \rVert_2 v_1 for some v1Nv_1 \in \mathcal{N}. Then:

yw0w12=yw02yw0yw02v12yw02214, \lVert y - w_0 - w_1 \rVert_2 = \lVert y - w_0 \rVert_2 \cdot \lVert \frac{y - w_0}{\lVert y - w_0 \rVert_2} - v_1 \rVert_2 \leq \frac{\lVert y - w_0 \rVert_2}{2} \leq \frac{1}{4},

as desired. We can continue to expand the residual as above and reason by induction.

Now consider a JLT for N\mathcal{N}, SS, so that with probability at least 1δ1 - \delta, we have that Sv,Sv=v,v±ϵ\langle Sv, Sv^\prime \rangle = \langle v, v^\prime \rangle \pm \epsilon for all pairs of v,vNv, v^\prime \in \mathcal{N}. If we applied the same map to any ySy \in \mathcal{S}, then we’d get, with probability at least 1δ1 - \delta:

Sy,Sy=Swi,Swi=iSwi,Swi+ijSwi,Swj=iwi22+ijwi,wj±ϵi,jwi2wj2=wi,wi+O(ϵ)=y,y±O(ϵ). \begin{aligned} \langle Sy, Sy \rangle &= \langle \sum S w_i, \sum Sw_i \rangle \\ &= \sum_i \langle Sw_i, Sw_i \rangle + \sum_{i \neq j} \langle Sw_i, Sw_j \rangle \\ &= \sum_i \lVert w_i \rVert_2^2 + \sum_{i \neq j} \langle w_i, w_j \rangle \pm \epsilon \sum_{i, j} \lVert w_i \rVert_2 \lVert w_j \rVert_2 \\ &= \langle \sum w_i, \sum w_i \rangle + \mathcal{O}(\epsilon) \\ &= \langle y, y \rangle \pm \mathcal{O}(\epsilon). \end{aligned}

So, we have established that a JLT for the 1/21/2-net over the unit sphere in the column space of AA is an 2\ell_2-subspace embedding for the column space of AA. We know that N5d\lvert \mathcal{N} \rvert \leq 5^d, so that k=Ω(ϵ2log(5d/δ))=Ω(ϵ2(d+log(1/δ)))k = \Omega(\epsilon^{-2} \log (5^d/\delta)) = \Omega(\epsilon^{-2} (d + \log (1/\delta))) as desired.