In an earlier post, we saw how the Johnson-Lindenstrauss Transform can reduce the dimensionality of a set of m points while preserving their pairwise distances. That's neat, but what if we need to preserve the pairwise distance between all (infinitely many) points in an entire subspace? This post introduces the concept of subspace embedding, which answers that question.
Say we have a matrix A∈Rn×d where n≫d. If a linear map S∈Rk×n, where k≪n, has the property that, for all x∈Rd and some ϵ∈(0,1), the following holds:
(1−ϵ)∥Ax∥2≤∥SAx∥2≤(1+ϵ)∥Ax∥2,
then we call S an ℓ2-subspace embedding of the column space of A, or simply an ℓ2-subspace embedding of A. Essentially, S preserves norms of the column space of A up to ϵ amount of error. To simplify notation, I’ll write the above expression as: ∥SAx∥2=(1±ϵ)∥Ax∥2.
We can, without loss of generality, assume that A has orthonormal columns. That is because, if that’s not the case, we can redefine A:=U where U comes from the SVD of A=UΣV∗. Indeed, the sets {y∣y=Ax,x∈Rd} and {y∣y=Uz,z∈Rρ} where ρ=rank(A) are the same.
What that means is that, we can simplify the condition to: ∥SAx∥2=(1±ϵ)∥x∥2. We can also require that the above hold for unit vectors x due to the linearity of the maps involved. That is in turn equivalent to ∥I−A∗S∗SA∥2≤ϵ where ∥⋅∥2 is the operator norm of its argument. To see this:
where we used the SVD (A∗S∗SA−I)=UΣV∗ and the fact that U and V have orthonormal columns.
What properties should S have? Well, first and foremost, we want k≪n so that SA has as few rows as possible. That is, we want the smallest subspace of the column space of A that offers a (1±ϵ)-approximation of the Euclidean norm. Additionally, we want to be able to compute SA rather quickly. That often means we prefer a sparser S.
Definition
Suppose that we have a distribution Π over the space of all k×n linear maps, where k is a function of n,d,ϵ,δ, with the following property: For every S∼Π and any fixed matrix A∈Rn×d, with probability at least 1−δ, S is a (1±ϵ)ℓ2-subspace embedding of A. Then we call Π an (ϵ,δ) oblivious ℓ2-subspace embedding.
As a side-note, in addition to oblivious embeddings, we can design sampling-based sketching algorithms that are called Leverage Score Sampling. I’ll discuss these in a later post.
Construction
Theorem: Let S=k1R∈Rk×n where the entries of R are iid standard normal random variables. If k=Θ(ϵ−2(d+log(1/δ))), then S is a (1±ϵ)ℓ2-subspace embedding for any matrix A∈Rn×d with probability at least 1−δ.
Proof: Notice that, we are trying to find an ℓ2-subspace embedding for the column space of A. We can also focus only on unit vectors in the column space, due to linearity. The idea behind the proof is to first lay an ϵ-net over the column space of A with the property that, a Johnson-Lindenstrauss Transform for the ϵ-net is a ℓ2-subspace embedding for the entire space. Let’s get to it.
Consider the set S={y∣y=Ax,∥y∥2=1} and its 1/2-net N. We can write every y∈S as the series: y=∑i=0∞wi where wi is a scaled member of N with the property that ∥wi∥≤2−i. To see that that’s always possible, take w0:=v0∈N and write y=w0+(y−w0). Clearly ∥y−w0∥2≤1/2 because w0∈N. Now write y=w0+w1+(y−w0−w1) where w1=∥y−w0∥2v1 for some v1∈N. Then:
as desired. We can continue to expand the residual as above and reason by induction.
Now consider a JLT for N, S, so that with probability at least 1−δ, we have that ⟨Sv,Sv′⟩=⟨v,v′⟩±ϵ for all pairs of v,v′∈N. If we applied the same map to any y∈S, then we’d get, with probability at least 1−δ:
So, we have established that a JLT for the 1/2-net over the unit sphere in the column space of A is an ℓ2-subspace embedding for the column space of A. We know that ∣N∣≤5d, so that k=Ω(ϵ−2log(5d/δ))=Ω(ϵ−2(d+log(1/δ))) as desired.