The JL Lemma offers a simple and elegant machinery to reduce the dimensionality of a fixed set of high-dimensional points while approximately preserving their pairwise Euclidean distance. This post explains the concept and describes simple randomized constructions of such a transformation.
Say you have a set V of m points in some high-dimensional space Rn. Let’s say n is really large, making any subsequent computation on the m×n data matrix (let’s call it X) rather expensive. This is where dimensionality reduction can prove helpful.
We know how to apply a number of tricks from basic linear algebra to do just that. For example, we can obtain a low-rank approximation of X by the simple act of projecting it onto the subspace spanned by its top k singular vectors. That gives a matrix X~ with rank at most k such that ∥X−X~∥F is small.
In some applications, it’s not enough to approximately preserve the Frobenius norm of X. For example, we may be interested, instead, in approximately preserving the pairwise Euclidean distance between the m points—a property that is important in Nearest Neighbor Search. That is, if x~i∈Rk is the image of x∈V under our dimensionality reducing transformation (where presumably k≪n), then we want the following to hold for all 1≤i≤j≤m:
for some ϵ∈(0,1). Occasionally I’ll write the above inequalities more compactly as:
∥x~i−x~j∥22=(1±ϵ)∥xi−xj∥22.(2)
It’s fairly easy to see that a bound on the Frobenius norm does not necessarily translate to this (1±ϵ)-approximation of pairwise Euclidean distances! We might in fact be able to lower the error in the Frobenius norm by moving some points farther from each other and others closer to each other. In other words, approximately preserving a global notion of distance is not equivalent to approximately preserving our local distances required by Equation (2).
That’s where the seminal result due to Johnson and Lindenstrauss comes into the picture. It’s led to a series of works that offer a simple and elegant, linear and randomized machinery to reduce dimensionality and yet approximately preserve local distances. These probabilistic algorithms play an important role in randomized numerical linear algebra such as subspace embedding. So it’s worth reviewing the basics. Let’s dive into it.
The Johnson-Lindenstrauss Transform
Distributional Johnson-Lindenstrauss (JL) Lemma. For every positive n and ϵ,δ∈(0,1), there is a distribution Π over linear maps in Rk×n with k=Θ(ϵ−2logδ1), such that, for any u∈Rn:
PS∼Π[∥Su∥22=(1±ϵ)∥u∥22]≥1−δ.(3)
I’ll refer to Equation (3) as the JL Property and say that a random matrix Shas the JL property if Equation (2) holds for the distribution over S’s.
How does this lemma relate to Equation (2) and our search for a transformation that preserves pairwise Euclidean distances? By setting u=(xi−xj)/∥xi−xj∥2 for all xi,xj∈V, and choosing δ to be at most δ/(2m), we can apply the union bound and obtain that ∥Sxi−Sxj∥=(1±ϵ)∥xi−xj∥ with probability at least 1−δ. In fact, that procedure gives the more familiar version of the Johnson-Lindenstrauss lemma, stated below:
Johnson-Lindenstrauss Lemma. Given a set V of m points in Rn and ϵ∈(0,1), there exists a mapping f:Rn→Rk where k=Ω(ϵ−2logδm), such that for all pairs xi,xj∈V, ∥f(xi)−f(xj)∥22=(1±ϵ)∥xi−xj∥22.
Before we continue, I should mention a neat consequence of the distributional JL lemma. In particular, inner products too are approximately preserved! Concretely, if points u and v on the unit sphere Sn−1 then:
P[∣∣∣⟨Su,Sv⟩−⟨u,v⟩∣∣∣≤ϵ]≥1−δ.(4)
That’s because, we can write:
4α(⟨Su,Sv⟩−⟨u,v⟩)=∥Su+Sv∥22−∥Su−Sv∥22−4⟨u,v⟩,
so that, with probability at least 1−δ, we have that:
More generally, for arbitrary x,y∈Rn, the following statement holds:
P[∣∣∣⟨Sx,Sy⟩−⟨x,y⟩∣∣∣≤ϵ∥x∥2∥y∥2]≥1−δ,
which follows by defining u=x/∥x∥2 and v=y/∥y∥2 in Equation (4).
Example Construction
Let’s review a few popular constructions of the random matrix S.
Gaussians
Let S=k1R∈Rk×n where the entries of R are iid standard normal random variables. Then S has the JL property.
Proof: First consider the case of u∈Sn−1. Notice that Su is a k-dimensional Gaussian random vector whose entries are distributed as N(0,1/k). That implies that ∥Su∥22 is distributed as X/k where X is a χ2 random variable with k degrees of freedom. We can then apply known tail bounds for a χ2 random variable with k degrees of freedom, such as:
P[∣∣∣kX−1∣∣∣≥ϵ]≤2exp(−8kϵ2),(5)
so that for our specific case, we have that:
P[∣∣∣∥Su∥22−1∣∣∣≥ϵ]≤2exp(−8kϵ2).
Setting the right-hand-side to δ, we get that:
2exp(−8kϵ2)=δ⇒k=Ω(ϵ−2logδ1).
Extending this to x∈Rn is simply a matter of defining u=x/∥x∥2, which gives the JL property.
That’s so elegant, isn’t it? All we need to do is to sample a k×n matrix whose entries are standard normal random variables, then scale it by 1/k, and apply the transformation to our points! That’s it! We get a whole new set of k-dimensional points where the worst distortion of pairwise distances is bounded.
Sparse Transformations
It turns out, while the result is simple to prove, it can be a bit expensive to apply in practice. There are other constructions that give much simpler functions S that are faster to use. Sampling a matrix whose entries are Rademacher random variables, for example, is one example:
Let S=k1R∈Rk×n where the entries of R are iid Rademacher random variables. Then S has the JL property.
Proof: As in the previous proof, it is enough to consider points on the unit sphere. As before, we need to argue about the concentration of ∥Su∥22 around its mean, which turns out to be ∥u∥22=1:
where ri’s are Rademacher random variables. The final step of bounding the tail probability (as in Equation (5)) is a bit involved, so I’ll just note that (Achlioptas, 2001) (Lemma 5) proves a bound of the following form:
P[∣∣∣∥Su∥22−1∣∣∣≥ϵ]≤exp(−2k(ϵ2/2−ϵ3/3)).
Again setting the right-hand-side to δ gives the desired result: k=Ω(ϵ−2logδ1).
Finally, we can even make S sparse by drawing the entries of R from the following distribution:
⎩⎪⎪⎨⎪⎪⎧30−3with probability 61with probability 32with probability 61.
Fast JL Transform (FJLT)
There is even a faster-to-apply family of JL transformations that’s simply called Fast JL Transform or FJLT for short. Let’s see how it works!
Let S=k1PHD where: D is a n×n diagonal matrix whose entries are Rademacher random variables; H is an n×n matrix with the property that H⊤H=nI and max∣Hij∣≤1; and P is a k×n matrix where each row is 0 everywhere except in one uniformly random column. Then S has the JL property with k=Ω(ϵ−2nlog(1/δ)log(n/δ)).
In general, a matrix H with that property is called an unnormalized bounded orthogonal system. A concrete example is the Hadamard matrix.
Proof: As before, we do this analysis for a vector u∈Rn with ∥u∥2=1.
Part I: First, consider Y=HD. We can immediately notice a few things about the product v=Yu. First, that its ℓ2 norm is simply:
∥v∥22=∥Yu∥22=u⊤DnIH⊤HDu=n∥u∥22=n.
Second, its i-th entry is: vi=∑jHijDjjuj. By following the same method to prove that a Rademacher random variable is sub-Gaussian, we can show that each term HijDjjuj is a sub-Gaussian random variable with constant Hijuj. As a result, vi, being the sum of n sub-Gaussian random variables, is itself sub-Gaussian with constant σ2=∑jHij2uj2≤∥u∥22=1.
Considering vi’s are sub-Gaussian with mean 0, we know the following Chernoff bound holds:
P[∣vi∣≥t]≤2exp(−2σ2t2)≤2e−t2/2,
so that if t=2log(4n/δ), then P[∣vi∣≥t]≤δ/2n. By the application of the union bound, we can deduce the following, which is known as the flattening lemma:
P[∥v∥∞≥t]≤2δ.
From here on, let’s condition on the flattening lemma being true with probability 1−δ/2.
Part II: Now, let’s consider k1Pv. We have to show that the deviation of the ℓ2 norm of this product from the norm of u is bounded. In other words, we want to assert that:
P[∣∣∣∣∥k1Pv∥22−1∣∣∣∣≥ϵ]≤δ.(P1)
Let X=∥k1Pv∥22. This can be expressed as the sum of random variables Xi’s where Xi=vi2/k for some uniformly random i∈[n]. As a result, E[Xi]=1/k and therefore E[X]=∑i=1kE[Xi]=1.
The variance of X is therefore bounded by 2nlog(4n/δ)/k.
Conditioned on the flattening lemma, Xi≤2log(4n/δ)/k=τ, making Xi a bounded random variable, hence sub-exponential. Their sum, X, is therefore sub-exponential as well. We can now apply Bernstein’s inequality as follows:
When k=Ω(ϵ−2nlog(1/δ)log(n/δ)), the above probability is at most δ/2.
Part III: We have thus far conditioned on the event E where ∥v∥∞≤2log(4n/δ) with probability 1−δ/2, and showed that inequality (P1) holds. We must now remove our condition:
Achlioptas, D. (2001). Database-friendly random projections. Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 274–281.