Sub-exponentialness is such an important property, as we saw in earlier posts. We learnt, for example, that sums of independent sub-exponential random variables are themselves sub-exponential random variables, allowing us to extend concentration results to sums. In this post, I'll review another important result that allows us to apply the same results (such as Hoeffding's inequality) to functions of independent random variables, under certain conditions.
We have so far reviewed concentration results for sub-Gaussian and sub-exponential random variables. Much of what we studied concerned sums of independent random variables: For example, we showed that the sum of independent sub-Gaussian random variables is itself a sub-Gaussian random variable, and used those results to derive inequalities.
In this post, we’ll take a look at the more general notion of stochastic processes and, in particular, a special stochastic process called martingale. This method helps us derive concentration bounds for certain functions of independent random variables - not just their sums. First, though, let’s review some of the basics.
Martingales
Consider a sequence of σ-algebras {Fn} such that Fi⊆Fi+1; that’s called a filtration. A sequence of random variables {Xn}n≥0 is a martingale if:
Xn is adapted to the filtration (i.e., Xn is Fn-measurable, so that E[Xn∣Fn]=Xn);
E[∣Xn∣] exists; and,
E[Xn∣Fn−1]=Xn−1 almost surely.
It’s easy to verify that the partial sum of independent zero-mean random variables is a martingale. That is, given a sequence of random variables {Xi}i=1n with mean zero, Sk=∑i=1kXi for k∈[n] is a martingale with respect to the σ-algebra generated by Xi’s. So is the product of random variables with mean 1: Pk=∏i=1kXi.
The Doob martingale is another example, which is constructed as follows: Say we have a function f:Rn→R and a sequence of n random variables {Xi}i=1n. If we define Yk=E[f(X)∣{Xi}i=1k] (or, more precisely, conditioning on the σ-algebra generated by the first k random variables), and let Y0=E[f(X)] and Yn=f(X), then the sequence {Yk}k=1n is a martingale if E[∣f(X)∣] exists.
Proof: We have to show that the sequence {Yk} has the properties of a martingale. First, notice that:
A related notion is the martingaledifference sequence: If a sequence {Δi}i≥1 with respect to a filtration {Fi}i≥1 has the property that E[∣Δk∣] is finite and that E[Δk∣Fk−1]=0, then it is a martingale difference sequence. For example, when {Xi}i≥0 is a martingale with respect to some filtration {Fi}i≥0, then {Δi}i≥1 where Δi=Xi−Xi−1 is a difference sequence.
Proof: The first property is trivial to show. As for the second property:
Tail Bounds for Sub-Exponential Martingale Difference Sequences
Let’s take a martingale difference sequence, {Δi}i≥1 with flitration {Fi}i≥1, and impose the following constraint on it:
E[eλΔk∣Fk−1]≤eλ2νk2/2,∀∣λ∣≤αk1.
In other words, we require that Δk is sub-exponential with parameters (νk,αk) for all k. Then it’s not hard to see the following result:
Any partial sum of the sub-exponential martingale differences, ∑i=1nΔi, with parameters (νk,αk) for all k, is itself a sub-exponential random variable.
Proof: Let’s start with the definition of a sub-exponential random variable:
As a direct consequence of this result, it is again straightforward to see that, if a martingale difference sequence is bounded, it is sub-exponential, and as such bounds such as Hoeffding’s inequality hold:
If {Δi}i≥1 is a martingale difference sequence with filtration {Fi}i≥1 with the property that Δi∈[ai,bi] almost surely for all i, then:
P[∣∣∣i=1∑nΔi∣∣∣≥t]≤2e−2t2/∑i=1n(bi−ai)2.
Proof: Because Δi∈[ai,bi] is bounded, it is sub-Gaussian with constant σi=(bi−ai)/2 as shown here. This tells us that:
E[eλΔi∣Fi−1]≤eλ2(bi−ai)2/8.
As a result, the sum of Δi’s is sub-Gaussian with constant ∑i=1n(bi−ai)2/2, so that we may apply Hoeffding’s inequality and obtain the result above.
Concentration of Functions with Bounded Difference Property
Sub-exponential martingale difference sequences are super useful! Well, really, the fact that bounded random variables are sub-Gaussian is a helpful result that, combined with martingale difference sequences allows us to apply tail bounds to stochastic processes.
Martingales aren’t limited to sums of independent random varaibles, however. So it is natural to ask if we can extend concentration results to a more general construction such as the Doob martingale. Why is such a generalization useful? To answer that, let’s have a look at one important example.
Suppose we have a function f:Rn→R and we are interested in understanding the concentration of f(X) around its mean, E[f(X)]. Now, what does the Doob martingale look like? Well, Y0=E[f(X)], Yn=f(X), and Yk=E[f(X)∣{Xi}i=1k]. But we can write:
f(X)−E[f(X)]=Yn−Y0=k=1∑n(Yk−Yk−1).
The right-hand side is the sum of martingale differences! So indeed we may be able to impose some conditions on the function f(⋅) and use our results from the previous section to bound the deviation of f(X) from its mean. The following result does exactly that:
McDiarmid’s Inequality: Suppose f has the property that ∣f(x)−f(x∖k)∣≤αk where x∖k denotes a copy of the vector x that differs from x in the k-th dimension only. In other words, changing the k-th coordinate of the input changes the output of f by at most αk. If the random vector X has independent components, then:
P[∣∣∣f(X)−E[f(X)]∣∣∣≥t]≤2exp(−∑i=1nαi22t2).
Proof: We have already seen that f(X)−E[f(X)] can be expressed as the sum of a martingale difference sequence. If we showed that Δk=Yk−Yk−1 belongs to an interval of length at most αk, then we can apply Hoeffding’s inequality to obtain the desired result. So we’ll focus on bounding Δk’s.
where we used the fact that E[f(X)∣{xi}i≤k]=E{Xi}i≥k+1[f(x1,x2,…,xk,Xk+1,…,Xn)].
Let’s illustrate the result by applying it to a rather ubiquitous random variable: If X⊂Rn is a vector collection, then:
R(X):=Eξ[u∈Xsup⟨u,ξ⟩],
where ξ∈{−1,1}n is a random Rademacher vector, is called the Rademacher complexity of the set X. Now consider the random variable Z=supu∈X⟨u,ξ⟩ and notice that it is bounded in the sense of the previous theorem: