One of the most basic inequalities governing the tail bounds of a distribution is Markov’s inequality. It says that, if X is a non-negative random variable, then its tail probability has the following upper-bound:
P[X≥t]≤tE[X].(1)
Clearly, this only makes sense if t>0 and E[X] is finite.
The Markov inequality is tight, in the sense that in cannot be improved in general. To see this, consider the random variable X taking values in {0,α} with probability p and 1−p. A straightforward calculation shows that the inequality in (1) becomes equality.
As such, any random variable with the property that X=t1X≥t almost surely leads to equality in Markov’s bound.
From Markov to Chebyshev to Chernoff
A fairly trivial extension of Markov’s inequality is the Chebyshev’s inequality. It is arrived at by a simple application of Equation (1) to the non-negative random variable X:=(Y−E[Y])2 for an arbitrary random variable Y:
P[∣Y−E[Y]∣≥t]≤t2Var[Y].(2)
Similar to the Markov inequality, we can show that Chebyshev’s inequality is tight in general. It is easy to construct a random variable that results in equality if we notice that we must have X=(Y−E[Y])2∈{0,α}. For example:
Y=⎩⎪⎪⎨⎪⎪⎧−α0αwith probability p/21−pp/2,
One can just as easily extend Chebyshev’s inequality to higher moments, so that:
P[∣Y−E[Y]∣≥t]≤tkE[∣∣∣Y−E[Y]∣∣∣k].(3)
Even more interesting is the fact that we can extend this formulation to functions other than polynomials! For example, consider the exponential function:
Optimizing our choice of λ so as to obtain the tightest result gives us the Chernoff bound!
We can show that, Equation (3) with the optimal choice of k is never worse than Equation (4) with the optimal choice of λ. Concretely, suppose X≥0 and that the moment generating function of X exists in a neighborhood around 0. Then for any given δ>0, we can see that:
Example (2): Any zero-mean bounded variable X supported on the interval [a,b] is sub-Gaussian with parameter at most σ=b−a.
Proving this requires the symmetrizationargument. We first introduce an independent copy of X and call it X′, and let γ be a Rademacher random variable. By the convexity of the exponential and Jensen’s inequality, we can write:
E[eλX]=E[eλX−EX′[X′]]≤EX,X′[eλ(X−X′)].
Using the fact that the distribution of X−X′ is the same as the distribution of γ(X−X′), and applying the result from Example (1), we can write:
Example (2) essentially gives us Hoeffding’s inequality. Concretely, because the sum of two sub-Gaussian random variables with constants σ12 and σ22 is itself a sub-Gaussian random variable with parameter σ12+σ22, we can show that if Xi’s are independent sub-Gaussian random variables with parameter σi, then we have that:
E[∑(Xi−μi)≥t]≤exp(−2∑σi2t2).
For bounded random variables, we can derive the familiar form:
E[∑(Xi−μi)≥t]≤exp−2n(b−a)2t2.
The result above can be further tightened by showing that, bounded random variable X∈[a,b] is indeed sub-Gaussian with parameter (b−a)/2.
Proving that last statement is a bit involved. We’d have to define the following function:
ψ(λ)=logE[eλX].
The idea here is expand ψ around 0:
ψ(λ)=λψ′(0)+2λ2ψ′′(0)+O(λ3).
Notice how the right-hand-side has the form we hope to get to in Equation (6): exp(ψ)≤exp(λμ+λ2σ2/2)? So if we showed that ψ′(0)≤μ and ψ′′(0)≤(b−a)2/4, then we have proven our claim.
To that end, consider the derivative of ψ with respect to λ which is simply:
ψ′(λ)=E[XeλX]/E[eλX].
It’s clear that ψ′(0)=μ, so that takes care of the first part. Its second derivative then is:
ψ′′(λ)=E[eλX]E[X2eλX]−(E[eλX]E[XeλX])2.
Now, we have to bound this to show that, when evaluated at 0, it is no greater than (b−a)2/4. Showing this is a bit tricky and involves a bit of measure theory. Let’s rearrange the terms in the last expression so it’s visually clearer what’s going on:
ψ′′(λ)=EX[X2E[eλX]eλX]−EX[XE[eλX]eλX]2.
This looks almost like the expression for the variance of a random variable, but with an extra term eλX/E[eλX]. Here, we can appeal to the Radon-Nikodym theorem. Because the conditions of the theorem hold (i.e., measures are absolutely continuous), we can define a new distribution Qλ so that eλX/E[eλX] is its Radon-Nikodym derivative with respect to the distribution over X. What that helps us accomplish is that, ψ′′ can now be expressed as the variance of a random variable drawn from Qλ:
ψ′′(λ)=EXλ∼Qλ[Xλ2]−EXλ∼Qλ[Xλ]2.
So, using the fact that the mean is the minimizer of the mean squared error, and knowing that Xλ∈[a,b], we can write:
Putting all the pieces together, we have shown that ψ(λ)≤λμ+λ2(b−a)2/8, which shows that a random variable bounded to the interval [a,b] is sub-Gaussian with constant (b−a)/2.