Johnson-Lindenstrauss Transform

The JL Lemma offers a simple and elegant machinery to reduce the dimensionality of a fixed set of high-dimensional points while approximately preserving their pairwise Euclidean distance. This post explains the concept and describes simple randomized constructions of such a transformation.

Say you have a set V\mathcal{V} of mm points in some high-dimensional space Rn\mathbb{R}^n. Let’s say nn is really large, making any subsequent computation on the m×nm \times n data matrix (let’s call it XX) rather expensive. This is where dimensionality reduction can prove helpful.

We know how to apply a number of tricks from basic linear algebra to do just that. For example, we can obtain a low-rank approximation of XX by the simple act of projecting it onto the subspace spanned by its top kk singular vectors. That gives a matrix X~\tilde{X} with rank at most kk such that XX~F\lVert X - \tilde{X} \rVert_F is small.

In some applications, it’s not enough to approximately preserve the Frobenius norm of XX. For example, we may be interested, instead, in approximately preserving the pairwise Euclidean distance between the mm points—a property that is important in Nearest Neighbor Search. That is, if x~iRk\tilde{x}_i \in \mathbb{R}^k is the image of xVx \in \mathcal{V} under our dimensionality reducing transformation (where presumably knk \ll n), then we want the following to hold for all 1ijm1 \leq i \leq j \leq m:

(1ϵ)xixj22x~ix~j22(1+ϵ)xixj22,(1) (1-\epsilon) \lVert x_i - x_j \rVert_2^2 \leq \lVert \tilde{x}_i - \tilde{x}_j \rVert_2^2 \leq (1+\epsilon) \lVert x_i - x_j \rVert_2^2, \tag{1}

for some ϵ(0,1)\epsilon \in (0, 1). Occasionally I’ll write the above inequalities more compactly as:

x~ix~j22=(1±ϵ)xixj22.(2) \lVert \tilde{x}_i - \tilde{x}_j\lVert_2^2 = (1\pm \epsilon) \lVert x_i - x_j \rVert_2^2. \tag{2}

It’s fairly easy to see that a bound on the Frobenius norm does not necessarily translate to this (1±ϵ)(1 \pm \epsilon)-approximation of pairwise Euclidean distances! We might in fact be able to lower the error in the Frobenius norm by moving some points farther from each other and others closer to each other. In other words, approximately preserving a global notion of distance is not equivalent to approximately preserving our local distances required by Equation (2).

That’s where the seminal result due to Johnson and Lindenstrauss comes into the picture. It’s led to a series of works that offer a simple and elegant, linear and randomized machinery to reduce dimensionality and yet approximately preserve local distances. These probabilistic algorithms play an important role in randomized numerical linear algebra such as subspace embedding. So it’s worth reviewing the basics. Let’s dive into it.

The Johnson-Lindenstrauss Transform

Distributional Johnson-Lindenstrauss (JL) Lemma. For every positive nn and ϵ,δ(0,1)\epsilon, \delta \in (0, 1), there is a distribution Π\Pi over linear maps in Rk×n\mathbb{R}^{k \times n} with k=Θ(ϵ2log1δ)k = \Theta(\epsilon^{-2} \log \frac{1}{\delta}), such that, for any uRnu \in \mathbb{R}^n:

PSΠ[Su22=(1±ϵ)u22]1δ.(3) \mathbb{P}_{S \sim \Pi}\big[ \lVert Su \rVert_2^2 = (1 \pm \epsilon) \lVert u \rVert_2^2 \big] \geq 1 - \delta. \tag{3}

I’ll refer to Equation (3) as the JL Property and say that a random matrix SS has the JL property if Equation (2) holds for the distribution over SS’s.

How does this lemma relate to Equation (2) and our search for a transformation that preserves pairwise Euclidean distances? By setting u=(xixj)/xixj2u = (x_i - x_j) / \lVert x_i - x_j \rVert_2 for all xi,xjVx_i, x_j \in \mathcal{V}, and choosing δ\delta to be at most δ/(m2)\delta/{m \choose 2}, we can apply the union bound and obtain that SxiSxj=(1±ϵ)xixj\lVert Sx_i - Sx_j \lVert = (1\pm \epsilon) \lVert x_i - x_j \rVert with probability at least 1δ1 - \delta. In fact, that procedure gives the more familiar version of the Johnson-Lindenstrauss lemma, stated below:

Johnson-Lindenstrauss Lemma. Given a set V\mathcal{V} of mm points in Rn\mathbb{R}^n and ϵ(0,1)\epsilon \in (0, 1), there exists a mapping f:RnRkf: \mathbb{R}^n \rightarrow \mathbb{R}^k where k=Ω(ϵ2logmδ)k = \Omega(\epsilon^{-2} \log \frac{m}{\delta}), such that for all pairs xi,xjVx_i, x_j \in \mathcal{V}, f(xi)f(xj)22=(1±ϵ)xixj22\lVert f(x_i) - f(x_j) \lVert_2^2 = (1 \pm \epsilon) \lVert x_i - x_j \lVert_2^2.

Before we continue, I should mention a neat consequence of the distributional JL lemma. In particular, inner products too are approximately preserved! Concretely, if points uu and vv on the unit sphere Sn1\mathbb{S}^{n-1} then:

P[Su,Svu,vϵ]1δ.(4) \mathbb{P} \Big[ \big\lvert \langle Su, Sv \rangle - \langle u, v \rangle \big\rvert \leq \epsilon \big] \geq 1 - \delta. \tag{4}

That’s because, we can write:

4(Su,Svu,v)α=Su+Sv22SuSv224u,v, 4\underbrace{\big( \langle Su, Sv \rangle - \langle u, v \rangle \big)}_{\alpha} = \lVert Su + Sv \rVert_2^2 -\lVert Su - Sv \rVert_2^2 - 4\langle u, v \rangle,

so that, with probability at least 1δ1 - \delta, we have that:

(1ϵ)u+v22(1+ϵ)uv224u,v4α(1+ϵ)u+v22(1ϵ)uv224u,vϵ(2u22+2v22)4αϵ(2u22+2v22)ϵαϵ. \begin{aligned} (1 - \epsilon) &\lVert u + v \lVert_2^2 - (1 + \epsilon) \lVert u - v \rVert_2^2 - 4\langle u, v \rangle \\ &\leq 4\alpha \leq (1 + \epsilon) \lVert u+v \lVert_2^2 - (1 - \epsilon) \lVert u - v \rVert_2^2 - 4\langle u, v \rangle \Rightarrow \\ -\epsilon &(2 \lVert u \rVert_2^2 + 2 \lVert v \rVert_2^2) \leq 4\alpha \leq\epsilon (2 \lVert u \rVert_2^2 + 2 \lVert v \rVert_2^2) \Rightarrow -\epsilon \leq \alpha \leq \epsilon. \end{aligned}

More generally, for arbitrary x,yRnx, y \in \mathbb{R}^n, the following statement holds:

P[Sx,Syx,yϵx2y2]1δ, \mathbb{P} \Big[ \big\lvert \langle Sx, Sy \rangle - \langle x, y \rangle \big\rvert \leq \epsilon \lVert x \rVert_2 \lVert y \rVert_2 \big] \geq 1 - \delta,

which follows by defining u=x/x2u=x/\lVert x \rVert_2 and v=y/y2v=y/\lVert y\rVert_2 in Equation (4).

Example Construction

Let’s review a few popular constructions of the random matrix SS.

Gaussians

Let S=1kRRk×nS=\frac{1}{\sqrt{k}}R \in \mathbb{R}^{k \times n} where the entries of RR are iid standard normal random variables. Then SS has the JL property.


Proof: First consider the case of uSn1u \in \mathbb{S}^{n-1}. Notice that SuSu is a kk-dimensional Gaussian random vector whose entries are distributed as N(0,1/k)\mathcal{N}(0, 1/k). That implies that Su22\lVert Su \rVert_2^2 is distributed as X/kX/k where XX is a χ2\chi^2 random variable with kk degrees of freedom. We can then apply known tail bounds for a χ2\chi^2 random variable with kk degrees of freedom, such as:

P[Xk1ϵ]2exp(kϵ28),(5) \mathbb{P} \Big[ \big\lvert \frac{X}{k} - 1 \big\rvert \geq \epsilon \Big] \leq 2\exp(-\frac{k\epsilon^2}{8}), \tag{5}

so that for our specific case, we have that:

P[Su221ϵ]2exp(kϵ28). \mathbb{P} \Big[ \big\lvert \lVert Su \rVert_2^2 - 1 \big\rvert \geq \epsilon \Big] \leq 2\exp(-\frac{k\epsilon^2}{8}).

Setting the right-hand-side to δ\delta, we get that:

2exp(kϵ28)=δk=Ω(ϵ2log1δ). 2\exp(-\frac{k\epsilon^2}{8}) = \delta \Rightarrow k=\Omega(\epsilon^{-2}\log\frac{1}{\delta}).

Extending this to xRnx \in \mathbb{R}^n is simply a matter of defining u=x/x2u = x/\lVert x \rVert_2, which gives the JL property.

That’s so elegant, isn’t it? All we need to do is to sample a k×nk \times n matrix whose entries are standard normal random variables, then scale it by 1/k1/\sqrt{k}, and apply the transformation to our points! That’s it! We get a whole new set of kk-dimensional points where the worst distortion of pairwise distances is bounded.

Sparse Transformations

It turns out, while the result is simple to prove, it can be a bit expensive to apply in practice. There are other constructions that give much simpler functions SS that are faster to use. Sampling a matrix whose entries are Rademacher random variables, for example, is one example:

Let S=1kRRk×nS=\frac{1}{\sqrt{k}}R \in \mathbb{R}^{k \times n} where the entries of RR are iid Rademacher random variables. Then SS has the JL property.


Proof: As in the previous proof, it is enough to consider points on the unit sphere. As before, we need to argue about the concentration of Su22\lVert Su \rVert_2^2 around its mean, which turns out to be u22=1\lVert u \rVert_2^2=1:

E[Su22]=E[i=1k(j=1n1krjuj)2]=1ki=1k(u22+jlE[rijril]ujul)=u22=1, \mathbb{E}\Big[\lVert Su \rVert_2^2\Big] = \mathbb{E}\Big[ \sum_{i=1}^k \big( \sum_{j=1}^n \frac{1}{\sqrt{k}} r_j u_j \big)^2 \Big] = \frac{1}{k} \sum_{i=1}^k \Big( \lVert u \rVert_2^2 + \sum_{j \neq l} \mathbb{E}[r_{ij} r_{il}] u_j u_l \Big) = \lVert u \rVert_2^2=1,

where rir_i’s are Rademacher random variables. The final step of bounding the tail probability (as in Equation (5)) is a bit involved, so I’ll just note that (Achlioptas, 2001) (Lemma 5) proves a bound of the following form:

P[Su221ϵ]exp(k2(ϵ2/2ϵ3/3)). \mathbb{P}\Big[ \big\lvert \lVert Su \rVert_2^2 - 1 \big\rvert \geq \epsilon \Big] \leq \exp(-\frac{k}{2} (\epsilon^2/2 - \epsilon^3/3) ).

Again setting the right-hand-side to δ\delta gives the desired result: k=Ω(ϵ2log1δ)k=\Omega(\epsilon^{-2} \log \frac{1}{\delta}).

Finally, we can even make SS sparse by drawing the entries of RR from the following distribution:

{3with probability 160with probability 233with probability 16. \begin{cases} \sqrt{3} & \text{with probability } \frac{1}{6}\\ 0 & \text{with probability } \frac{2}{3} \\ -\sqrt{3} & \text{with probability } \frac{1}{6}\\ \end{cases}.

Fast JL Transform (FJLT)

There is even a faster-to-apply family of JL transformations that’s simply called Fast JL Transform or FJLT for short. Let’s see how it works!

Let S=1kPHDS = \frac{1}{\sqrt{k}} PHD where: DD is a n×nn\times n diagonal matrix whose entries are Rademacher random variables; HH is an n×nn \times n matrix with the property that HH=nIH^\top H = nI and maxHij1\max \lvert H_{ij} \rvert \leq 1; and PP is a k×nk \times n matrix where each row is 00 everywhere except in one uniformly random column. Then SS has the JL property with k=Ω(ϵ2nlog(1/δ)log(n/δ))k = \Omega(\epsilon^{-2} n \log (1/\delta) \log (n/\delta)).

In general, a matrix HH with that property is called an unnormalized bounded orthogonal system. A concrete example is the Hadamard matrix.

Proof: As before, we do this analysis for a vector uRnu \in \mathbb{R}^n with u2=1\lVert u \rVert_2 = 1.

Part I: First, consider Y=HDY=HD. We can immediately notice a few things about the product v=Yuv = Yu. First, that its 2\ell_2 norm is simply:

v22=Yu22=uDHHnIDu=nu22=n. \lVert v \rVert_2^2 = \lVert Yu \rVert_2^2 = u^\top D \underbrace{H^\top H}_{nI} D u = n\lVert u \rVert_2^2 = n.

Second, its ii-th entry is: vi=jHijDjjujv_i = \sum_{j} H_{ij} D_{jj} u_j. By following the same method to prove that a Rademacher random variable is sub-Gaussian, we can show that each term HijDjjujH_{ij} D_{jj} u_j is a sub-Gaussian random variable with constant HijujH_{ij} u_j. As a result, viv_i, being the sum of nn sub-Gaussian random variables, is itself sub-Gaussian with constant σ2=jHij2uj2u22=1\sigma^2 = \sum_j H_{ij}^2 u_j^2 \leq \lVert u \rVert_2^2 = 1.

Considering viv_i’s are sub-Gaussian with mean 00, we know the following Chernoff bound holds:

P[vit]2exp(t22σ2)2et2/2, \mathbb{P}[\lvert v_i \rvert \geq t] \leq 2 \exp \Big( -\frac{t^2}{2\sigma^2} \Big) \leq 2 e^{-t^2/2},

so that if t=2log(4n/δ)t=\sqrt{2\log (4n / \delta)}, then P[vit]δ/2n\mathbb{P}[\lvert v_i \rvert \geq t] \leq \delta/2n. By the application of the union bound, we can deduce the following, which is known as the flattening lemma:

P[vt]δ2. \mathbb{P}[\lVert v \rVert_\infty \geq t] \leq \frac{\delta}{2}.

From here on, let’s condition on the flattening lemma being true with probability 1δ/21 - \delta/2.

Part II: Now, let’s consider 1kPv\frac{1}{\sqrt{k}} Pv. We have to show that the deviation of the 2\ell_2 norm of this product from the norm of uu is bounded. In other words, we want to assert that:

P[1kPv221ϵ]δ.(P1) \mathbb{P}\Bigg[\Big\lvert \lVert \frac{1}{\sqrt{k}} Pv\rVert_2^2 - 1 \Big\rvert \geq \epsilon \Bigg] \leq \delta. \tag{P1}

Let X=1kPv22X=\lVert \frac{1}{\sqrt{k}} Pv \rVert_2^2. This can be expressed as the sum of random variables XiX_i’s where Xi=vi2/kX_i = v_i^2/k for some uniformly random i[n]i \in [n]. As a result, E[Xi]=1/k\mathbb{E}[X_i] = 1/k and therefore E[X]=i=1kE[Xi]=1\mathbb{E}[X] = \sum_{i=1}^k \mathbb{E}[X_i] = 1.

Let us now bound Var[Xi]\mathop{Var}[X_i]. Notice that:

Var[Xi]E[Xi2]=1k2E[(u(HD)i(HD)iYiu)2]=1k2E[uYiYiYiYiu]=1k2i=1n1nuYi(YiYi)Yiu=1k21nuYiYiuYi22=1k21n(Yiu)2Yi222log(4nδ)nk2YF=1k22nlog(4nd) \begin{aligned} \mathop{Var}[X_i] &\leq \mathbb{E}[X_i^2] = \frac{1}{k^2} \mathbb{E}[(u^\top (HD)_i^\top \underbrace{(HD)_i}_{Y_i} u)^2] \\ &= \frac{1}{k^2} \mathbb{E}[u^\top Y_i^\top Y_i Y_i^\top Y_i u] \\ &= \frac{1}{k^2} \sum_{i=1}^n \frac{1}{n} u^\top Y_i^\top (Y_iY_i^\top) Y_i u \\ &= \frac{1}{k^2} \sum \frac{1}{n} u^\top Y_i^\top Y_i u \lVert Y_i \rVert_2^2 \\ &= \frac{1}{k^2} \sum \frac{1}{n} (Y_i u)^2 \lVert Y_i \rVert_2^2 \\ &\leq \frac{2 \log(\frac{4n}{\delta})}{n k^2} \lVert Y \rVert_F = \frac{1}{k^2} 2n \log(\frac{4n}{d}) \end{aligned}

The variance of XX is therefore bounded by 2nlog(4n/δ)/k2n\log(4n/\delta)/k.

Conditioned on the flattening lemma, Xi2log(4n/δ)/k=τX_i \leq 2 \log (4n / \delta)/k = \tau, making XiX_i a bounded random variable, hence sub-exponential. Their sum, XX, is therefore sub-exponential as well. We can now apply Bernstein’s inequality as follows:

P[X1ϵ]2exp(ϵ22(Var[X]+τϵ))2exp(ϵ22(2nlog(4n/δ)/k+2ϵlog(4n/δ)/k))=2exp(kϵ24log(4n/δ)(n+ϵ)). \begin{aligned} \mathbb{P}\Big[ \big\lvert X - 1 \big\rvert \geq \epsilon \Big] &\leq 2\exp(-\frac{\epsilon^2}{2(\mathop{Var}[X] + \tau \epsilon)}) \\ &\leq 2 \exp \Big( - \frac{\epsilon^2}{2 \big( 2n\log(4n/\delta)/k + 2\epsilon \log(4n/\delta)/k \big) } \Big) \\ &=2 \exp(-\frac{k\epsilon^2}{4\log(4n/\delta) (n + \epsilon)} ). \end{aligned}

When k=Ω(ϵ2nlog(1/δ)log(n/δ))k = \Omega(\epsilon^{-2} n\log(1/\delta) \log(n/\delta)), the above probability is at most δ/2\delta/2.

Part III: We have thus far conditioned on the event E\mathcal{E} where v2log(4n/δ)\lVert v \rVert_\infty \leq \sqrt{2 \log (4n/\delta)} with probability 1δ/21 - \delta/2, and showed that inequality (P1) holds. We must now remove our condition:

P[E]P[1kPv221ϵ  E](1δ2)21δ. \mathbb{P}[\mathcal{E}] \cdot \mathbb{P}\Bigg[ \Big\lvert \lVert \frac{1}{\sqrt{k}} Pv \lVert_2^2 - 1 \Big\rvert \leq \epsilon \; \Big\lvert \mathcal{E} \Bigg] \geq (1- \frac{\delta}{2})^2 \geq 1 - \delta.

  1. Achlioptas, D. (2001). Database-friendly random projections. Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 274–281.