Concentration of Sub-Gaussian Random Variables

This post reviews basic tail bounds (Markov, Chebyshev, Chernoff, Hoeffding) and explains the concept of sub-Gaussian random variables.

Basic Tail Bounds

One of the most basic inequalities governing the tail bounds of a distribution is Markov’s inequality. It says that, if XX is a non-negative random variable, then its tail probability has the following upper-bound:

P[Xt]E[X]t.(1) \mathbb{P}[X \geq t] \leq \frac{\mathbb{E}[X]}{t}. \tag{1}

Clearly, this only makes sense if t>0t > 0 and E[X]\mathbb{E}[X] is finite.

The Markov inequality is tight, in the sense that in cannot be improved in general. To see this, consider the random variable XX taking values in {0,α}\{ 0, \alpha \} with probability pp and 1p1-p. A straightforward calculation shows that the inequality in (1) becomes equality.

More generally, notice that:

P[Xt]=E[X]ttf(x)  dx=01txf(x)  dx01xtf(x)  dx=01txf(x)  dx0(t1xtx)f(x)  dx=0x=t1xt. \begin{aligned} \mathbb{P}[X \geq t] = \frac{\mathbb{E}[X]}{t} &\Rightarrow \int_{t}^\infty f(x)\;dx = \int_{0}^\infty \frac{1}{t} x f(x)\; dx \\ &\Rightarrow \int_{0}^\infty 1_{x \geq t} f(x)\;dx = \int_{0}^\infty \frac{1}{t} x f(x)\; dx\\ &\Rightarrow \int_0^\infty (t 1_{x \geq t} - x) f(x) \; dx = 0 \\ &\Rightarrow x=t1_{x\geq t}. \end{aligned}

As such, any random variable with the property that X=t1XtX=t 1_{X \geq t} almost surely leads to equality in Markov’s bound.

From Markov to Chebyshev to Chernoff

A fairly trivial extension of Markov’s inequality is the Chebyshev’s inequality. It is arrived at by a simple application of Equation (1) to the non-negative random variable X:=(YE[Y])2X := (Y - \mathbb{E}[Y])^2 for an arbitrary random variable YY:

P[YE[Y]t]Var[Y]t2.(2) \mathbb{P}\big[ \lvert Y - \mathbb{E}[Y] \rvert \geq t \big] \leq \frac{\mathop{Var}[Y]}{t^2}. \tag{2}

Similar to the Markov inequality, we can show that Chebyshev’s inequality is tight in general. It is easy to construct a random variable that results in equality if we notice that we must have X=(YE[Y])2{0,α}X = (Y - \mathbb{E}[Y])^2 \in \{0, \alpha\}. For example:

Y={αwith probability p/201pαp/2, Y = \begin{cases} -\sqrt{\alpha} & \text{with probability } p/2 \\ 0 & 1-p \\ \sqrt{\alpha} & p/2 \end{cases},

One can just as easily extend Chebyshev’s inequality to higher moments, so that:

P[YE[Y]t]E[YE[Y]k]tk.(3) \mathbb{P}\big[ \lvert Y - \mathbb{E}[Y] \rvert \geq t \big] \leq \frac{\mathbb{E}\big[ \big\lvert Y - \mathbb{E}[Y] \big\rvert^k \big]}{t^k}. \tag{3}

Even more interesting is the fact that we can extend this formulation to functions other than polynomials! For example, consider the exponential function:

P[XE[X]t]=P[eλ(XE[X])eλt]E[eλ(XE[X])]eλt.(4) \mathbb{P}[ X - \mathbb{E}[X] \geq t ] = \mathbb{P}[ e^{\lambda (X - \mathbb{E}[X])} \geq e^{\lambda t} ] \leq \frac{\mathbb{E}\big[ e^{\lambda (X - \mathbb{E}[X])} \big]}{e^{\lambda t}}. \tag{4}

Optimizing our choice of λ\lambda so as to obtain the tightest result gives us the Chernoff bound!

We can show that, Equation (3) with the optimal choice of kk is never worse than Equation (4) with the optimal choice of λ\lambda. Concretely, suppose X0X \geq 0 and that the moment generating function of XX exists in a neighborhood around 00. Then for any given δ>0\delta > 0, we can see that:

E[eλX]=iλiE[Xi]i!=i(λδ)ii!E[Xi]δii(λδ)ii!infk=1,2,3,E[Xk]δk=eλδinfk=1,2,3,E[Xk]δk, \begin{aligned} \mathbb{E}[e^{\lambda X}] &= \sum_i \frac{\lambda^i \mathbb{E}[X^i]}{i!} = \sum_i \frac{(\lambda \delta)^i}{i!} \frac{\mathbb{E}[X^i]}{\delta^i} \\ &\geq \sum_i \frac{(\lambda \delta)^i}{i!} \inf_{k=1,2,3,\ldots} \frac{\mathbb{E}[X^k]}{\delta^k} = e^{\lambda \delta} \inf_{k=1,2,3,\ldots} \frac{\mathbb{E}[X^k]}{\delta^k}, \end{aligned}

so that:

infk=1,2,3,E[Xk]δkinfλ>0E[eλX]eλδ. \inf_{k=1,2,3,\ldots} \frac{\mathbb{E}[X^k]}{\delta^k} \leq \inf_{\lambda > 0} \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda \delta}}.

Application to Gaussian variables

Consider XN(μ,σ2).X \sim \mathcal{N}(\mu, \sigma^2). Let’s derive the Chernoff bound for XX by optimizing our choice of λ\lambda. First, let’s derive the right-hand-side in Equation (4):

eλt  E[eλ(Xμ)]=eλt  12πσ2exp((xμ)22σ2)exp(λ(xμ))  dx=y:=xμeλt  12πσ2exp(y22σ2λy+σ4λ22σ2)exp(σ2λ22)  dx=eλt  exp(σ2λ22)12πσ2exp((yλ)22)  dx=eλt  eσ2λ2/2. \begin{aligned} e^{-\lambda t} \; \mathbb{E}\big[ e^{\lambda (X - \mu)} \big] &= e^{-\lambda t} \; \int \frac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2\sigma^2}) \exp(\lambda (x - \mu))\; dx \\ &\stackrel{y:=x-\mu}{=} e^{-\lambda t} \; \int \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-\frac{y^2 - 2\sigma^2\lambda y + \sigma^4\lambda^2}{2\sigma^2}) \exp(\frac{\sigma^2\lambda^2}{2}) \; dx \\ &= e^{-\lambda t} \; \exp(\frac{\sigma^2\lambda^2}{2}) \int \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-\frac{(y - \lambda)^2}{2}) \; dx \\ &= e^{-\lambda t} \; e^{\sigma^2\lambda^2/2}. \end{aligned}

Differentiating with respect to λ\lambda and setting it to 00:

λeλteσ2λ2/2=(σ2λt)eλteσ2λ2/2=0, \begin{aligned} \frac{\partial}{\partial \lambda} e^{-\lambda t} e^{\sigma^2\lambda^2/2} =(\sigma^2 \lambda - t) e^{-\lambda t} e^{\sigma^2\lambda^2/2} = 0, \end{aligned}

gives us the optimal choice λ=t/σ2\lambda = t/\sigma^2. Plugging this value back in Equation (4) gives the Chernoff bound for our Gaussian random variable:

P[Xμt]et2/2σ2et2/σ2=et2/2σ2.(5) \mathbb{P}[ X - \mu \geq t ] \leq \frac{e^{t^2/2\sigma^2}}{e^{t^2/\sigma^2}} = e^{-t^2/2\sigma^2}. \tag{5}

Sub-Gaussian random variables

It is easy to see that, if the following holds for a random variable XX:

E[eλ(Xμ)]eσ2λ2/2  λR,(6) \mathbb{E}\big[ e^{\lambda (X - \mu)} \big] \leq e^{\sigma^2 \lambda^2/2} \quad \forall \; \lambda \in \mathbb{R}, \tag{6}

for some constant σ\sigma, then inequality (5) holds for XX. We call such random variables Sub-Gaussian with parameter σ\sigma.

Example (1): Rademacher random variables are sub-Gaussian with parameter σ=1\sigma=1. That is because:

E[eλX]=12(eλ+eλ)=12(k=0λkk!+(λ)kk!)=k=0λ2k(2k)!k=0λ2k2kk!=eλ2/2. \begin{aligned} \mathbb{E}[e^{\lambda X}] &= \frac{1}{2}\big( e^{\lambda} + e^{-\lambda} \big) \\ &= \frac{1}{2} \big( \sum_{k=0} \frac{\lambda^k}{k!} + \frac{(-\lambda)^k}{k!} \big) \\ &= \sum_{k=0} \frac{\lambda^{2k}}{(2k)!} \\ &\leq \sum_{k=0} \frac{\lambda^{2k}}{2^k k!} = e^{\lambda^2/2}. \end{aligned}

Example (2): Any zero-mean bounded variable XX supported on the interval [a,b][a, b] is sub-Gaussian with parameter at most σ=ba\sigma=b-a.

Proving this requires the symmetrization argument. We first introduce an independent copy of XX and call it XX^\prime, and let γ\gamma be a Rademacher random variable. By the convexity of the exponential and Jensen’s inequality, we can write:

E[eλX]=E[eλXEX[X]]EX,X[eλ(XX)]. \mathbb{E}[e^{\lambda X}] = \mathbb{E}\big[ e^{\lambda X - \mathbb{E}_{X^\prime}[X^\prime]} \big] \leq \mathbb{E}_{X, X^\prime} [e^{\lambda(X - X^\prime)}].

Using the fact that the distribution of XXX - X^\prime is the same as the distribution of γ(XX)\gamma (X - X^\prime), and applying the result from Example (1), we can write:

E[eλX]EX,X,γ[eλγ(XX)]EX,X[eλ2(XX)2/2]eλ2(ba)2/2. \mathbb{E}[e^{\lambda X}] \leq \mathbb{E}_{X, X^\prime, \gamma} \big[ e^{\lambda \gamma (X - X^\prime)} \big] \leq \mathbb{E}_{X, X^\prime} \big[ e^{\lambda^2 (X - X^\prime)^2 /2} \big] \leq e^{\lambda^2 (b - a)^2/2}.

Example (2) essentially gives us Hoeffding’s inequality. Concretely, because the sum of two sub-Gaussian random variables with constants σ12\sigma_1^2 and σ22\sigma_2^2 is itself a sub-Gaussian random variable with parameter σ12+σ22\sigma_1^2 + \sigma_2^2, we can show that if XiX_i’s are independent sub-Gaussian random variables with parameter σi\sigma_i, then we have that:

E[(Xiμi)t]exp(t22σi2). \mathbb{E}[\sum (X_i - \mu_i) \geq t] \leq \exp\Big({-\frac{t^2}{2\sum \sigma_i^2}}\Big).

For bounded random variables, we can derive the familiar form:

E[(Xiμi)t]expt22n(ba)2. \mathbb{E}[\sum (X_i - \mu_i) \geq t] \leq \exp{-\frac{t^2}{2 n (b-a)^2}}.

The result above can be further tightened by showing that, bounded random variable X[a,b]X \in [a, b] is indeed sub-Gaussian with parameter (ba)/2(b - a)/2.

Proving that last statement is a bit involved. We’d have to define the following function:

ψ(λ)=logE[eλX]. \psi(\lambda) = \log \mathbb{E}[e^{\lambda X}].

The idea here is expand ψ\psi around 00:

ψ(λ)=λψ(0)+λ22ψ(0)+O(λ3). \psi(\lambda) = \lambda \psi^\prime(0) + \frac{\lambda^2}{2} \psi^{\prime\prime}(0) + \mathcal{O}(\lambda^3).

Notice how the right-hand-side has the form we hope to get to in Equation (6): exp(ψ)exp(λμ+λ2σ2/2)\exp(\psi) \leq \exp(\lambda \mu + \lambda^2 \sigma^2/2)? So if we showed that ψ(0)μ\psi^\prime(0) \leq \mu and ψ(0)(ba)2/4\psi^{\prime\prime}(0) \leq (b-a)^2/4, then we have proven our claim.

To that end, consider the derivative of ψ\psi with respect to λ\lambda which is simply:

ψ(λ)=E[XeλX]/E[eλX]. \psi^\prime(\lambda) = \mathbb{E}[Xe^{\lambda X}] / \mathbb{E}[e^{\lambda X}].

It’s clear that ψ(0)=μ\psi^\prime(0)=\mu, so that takes care of the first part. Its second derivative then is:

ψ(λ)=E[X2eλX]E[eλX](E[XeλX]E[eλX])2. \psi^{\prime\prime}(\lambda) = \frac{\mathbb{E}[X^2 e^{\lambda X}]}{\mathbb{E}[e^{\lambda X}]} - \Big( \frac{\mathbb{E}[X e^{\lambda X}]}{\mathbb{E}[e^{\lambda X}]} \Big)^2.

Now, we have to bound this to show that, when evaluated at 00, it is no greater than (ba)2/4(b-a)^2/4. Showing this is a bit tricky and involves a bit of measure theory. Let’s rearrange the terms in the last expression so it’s visually clearer what’s going on:

ψ(λ)=EX[X2eλXE[eλX]]EX[XeλXE[eλX]]2. \psi^{\prime\prime}(\lambda) = \mathbb{E}_X\Bigg[ X^2 \frac{e^{\lambda X}}{\mathbb{E}[e^{\lambda X}]} \Bigg] - \mathbb{E}_X\Bigg[ X \frac{e^{\lambda X}}{\mathbb{E}[e^{\lambda X}]} \Bigg]^2.

This looks almost like the expression for the variance of a random variable, but with an extra term eλX/E[eλX]e^{\lambda X} / \mathbb{E}[e^{\lambda X}]. Here, we can appeal to the Radon-Nikodym theorem. Because the conditions of the theorem hold (i.e., measures are absolutely continuous), we can define a new distribution QλQ_\lambda so that eλX/E[eλX]e^{\lambda X} / \mathbb{E}[e^{\lambda X}] is its Radon-Nikodym derivative with respect to the distribution over XX. What that helps us accomplish is that, ψ\psi^{\prime\prime} can now be expressed as the variance of a random variable drawn from QλQ_\lambda:

ψ(λ)=EXλQλ[Xλ2]EXλQλ[Xλ]2. \psi^{\prime\prime}(\lambda) = \mathbb{E}_{X_\lambda \sim Q_\lambda}[X_\lambda^2] - \mathbb{E}_{X_\lambda \sim Q_\lambda}[X_\lambda]^2.

So, using the fact that the mean is the minimizer of the mean squared error, and knowing that Xλ[a,b]X_\lambda \in [a, b], we can write:

ψ(λ)=E[(XλE[Xλ])2]E[(Xλa+b2)2](ba+b2)2=(ba)2/4. \psi^{\prime\prime}(\lambda) = \mathbb{E}[(X_\lambda - \mathbb{E}[X_\lambda])^2] \leq \mathbb{E}[(X_\lambda - \frac{a+b}{2})^2] \leq (b - \frac{a+b}{2})^2 = (b - a)^2/4.

Putting all the pieces together, we have shown that ψ(λ)λμ+λ2(ba)2/8\psi(\lambda) \leq \lambda \mu + \lambda^2(b-a)^2/8, which shows that a random variable bounded to the interval [a,b][a, b] is sub-Gaussian with constant (ba)/2(b - a)/2.