Concentration of Martingale Difference Sequences

Sub-exponentialness is such an important property, as we saw in earlier posts. We learnt, for example, that sums of independent sub-exponential random variables are themselves sub-exponential random variables, allowing us to extend concentration results to sums. In this post, I'll review another important result that allows us to apply the same results (such as Hoeffding's inequality) to functions of independent random variables, under certain conditions.

We have so far reviewed concentration results for sub-Gaussian and sub-exponential random variables. Much of what we studied concerned sums of independent random variables: For example, we showed that the sum of independent sub-Gaussian random variables is itself a sub-Gaussian random variable, and used those results to derive inequalities.

In this post, we’ll take a look at the more general notion of stochastic processes and, in particular, a special stochastic process called martingale. This method helps us derive concentration bounds for certain functions of independent random variables - not just their sums. First, though, let’s review some of the basics.

Martingales

Consider a sequence of σ\sigma-algebras {Fn}\{ \mathcal{F}_n \} such that FiFi+1\mathcal{F}_i \subseteq \mathcal{F}_{i+1}; that’s called a filtration. A sequence of random variables {Xn}n0\{ X_n \}_{n \geq 0} is a martingale if:

It’s easy to verify that the partial sum of independent zero-mean random variables is a martingale. That is, given a sequence of random variables {Xi}i=1n\{ X_i \}_{i=1}^n with mean zero, Sk=i=1kXiS_k=\sum_{i=1}^k X_i for k[n]k \in [n] is a martingale with respect to the σ\sigma-algebra generated by XiX_i’s. So is the product of random variables with mean 11: Pk=i=1kXiP_k=\prod_{i=1}^k X_i.

The Doob martingale is another example, which is constructed as follows: Say we have a function f:RnRf:\mathbb{R}^n \rightarrow \mathbb{R} and a sequence of nn random variables {Xi}i=1n\{ X_i \}_{i=1}^n. If we define Yk=E[f(X)    {Xi}i=1k]Y_k=\mathbb{E}\big[ f(X) \;|\; \{ X_i \}_{i=1}^{k} \big] (or, more precisely, conditioning on the σ\sigma-algebra generated by the first kk random variables), and let Y0=E[f(X)]Y_0=\mathbb{E}[f(X)] and Yn=f(X)Y_n=f(X), then the sequence {Yk}k=1n\{ Y_k \}_{k=1}^n is a martingale if E[f(X)]\mathbb{E}[\lvert f(X) \rvert] exists.

Proof: We have to show that the sequence {Yk}\{ Y_k \} has the properties of a martingale. First, notice that:

E[Yk]=E[E[f(X)    {Xi}i=1k]]E[E[f(X)    {Xi}i=1k]]=E[f(X)], \mathbb{E} \Big[ \lvert Y_k \rvert \Big] = \mathbb{E} \Big[ \lvert \mathbb{E}\big[ f(X) \;|\; \{ X_i \}_{i=1}^k \big] \rvert \Big] \leq \mathbb{E} \Big[ \mathbb{E}\big[ \lvert f(X) \rvert \;|\; \{ X_i \}_{i=1}^k \big] \Big]= \mathbb{E}\Big[ \lvert f(X) \rvert \Big],

which exists by assumption. As for the second property:

E[Yk+1    {Xi}i=1k]=E[E[f(X)    {Xi}i=1k+1]    {Xi}i=1k]=E[f(X)    {Xi}i=1k]=Yk. \mathbb{E}\Big[ Y_{k+1} \;|\; \{X_i\}_{i=1}^k \Big] = \mathbb{E}\Big[ \mathbb{E}\big[ f(X) \;|\; \{ X_i\}_{i=1}^{k+1} \big] \;|\; \{X_i\}_{i=1}^k \Big] =\mathbb{E}\Big[ f(X) \;|\; \{X_i\}_{i=1}^{k} \Big] = Y_k.

A related notion is the martingale difference sequence: If a sequence {Δi}i1\{ \Delta_i \}_{i\geq 1} with respect to a filtration {Fi}i1\{ \mathcal{F}_i \}_{i \geq 1} has the property that E[Δk]\mathbb{E}[\lvert \Delta_k \rvert] is finite and that E[Δk    Fk1]=0\mathbb{E}[\Delta_k \;|\; \mathcal{F}_{k-1}]=0, then it is a martingale difference sequence. For example, when {Xi}i0\{ X_i \}_{i\geq 0} is a martingale with respect to some filtration {Fi}i0\{ \mathcal{F}_i\}_{i \geq 0}, then {Δi}i1\{ \Delta_i \}_{i \geq 1} where Δi=XiXi1\Delta_i=X_i-X_{i-1} is a difference sequence.

Proof: The first property is trivial to show. As for the second property:

E[Δk+1    Fk]=E[Xk+1Xk    Fk]=E[Xk+1    Fk]Xk=XkXk=0. \mathbb{E}\Big[\Delta_{k+1} \;|\; \mathcal{F}_k \Big] = \mathbb{E}\Big[ X_{k+1} - X_k \;|\; \mathcal{F}_k \Big] = \mathbb{E}\Big[ X_{k+1} \;|\; \mathcal{F}_k \Big] - X_k = X_k-X_k=0.

Tail Bounds for Sub-Exponential Martingale Difference Sequences

Let’s take a martingale difference sequence, {Δi}i1\{ \Delta_i \}_{i \geq 1} with flitration {Fi}i1\{ \mathcal{F}_i \}_{i \geq 1}, and impose the following constraint on it:

E[eλΔk    Fk1]eλ2νk2/2,  λ1αk. \mathbb{E}\Big[ e^{\lambda \Delta_k} \;|\; \mathcal{F_{k-1}}\Big] \leq e^{\lambda^2 \nu_k^2/2}, \quad \forall \; \lvert \lambda \rvert \leq \frac{1}{\alpha_k}.

In other words, we require that Δk\Delta_k is sub-exponential with parameters (νk,αk)(\nu_k, \alpha_k) for all kk. Then it’s not hard to see the following result:

Any partial sum of the sub-exponential martingale differences, i=1nΔi\sum_{i=1}^n \Delta_i, with parameters (νk,αk)(\nu_k, \alpha_k) for all kk, is itself a sub-exponential random variable.

Proof: Let’s start with the definition of a sub-exponential random variable:

E[eλi=1nΔi]=E[eλi=1n1ΔiE[eλΔn    Fn1]]E[eλi=1n1Δi]eλ2νn2/2. \mathbb{E}\Big[ e^{\lambda \sum_{i=1}^n \Delta_i} \Big] = \mathbb{E}\Big[ e^{\lambda \sum_{i=1}^{n-1} \Delta_i} \mathbb{E}\big[ e^{\lambda \Delta_n} \;|\; \mathcal{F}_{n-1} \big] \Big] \leq \mathbb{E}\Big[ e^{\lambda \sum_{i=1}^{n-1} \Delta_i} \Big] e^{\lambda^2 \nu_n^2/2}.

Continuing with this process iteratively, we find that the partial sum is indeed sub-exponential with parameters (i=1nνi2,  maxi[n]αi)\Big( \sqrt{\sum_{i=1}^n \nu_i^2},\; \max_{i \in [n]} \alpha_i \Big).

As a result, we can apply tail bounds for sub-exponential random variables to the random variable i=1nΔi\sum_{i=1}^n \Delta_i. For example:

P[i=1nΔit]{2exp(t22i=1nνi2)0ti=1nνi2maxi[n]αi;2exp(t22maxi[n]αi)t>i=1nνi2maxi[n]αi. \mathbb{P}\Big[ \big\lvert \sum_{i=1}^n \Delta_i \big\rvert \geq t \Big] \leq \begin{cases} 2\exp\big( -\frac{t^2}{2\sum_{i=1}^n \nu_i^2} \big) & 0 \leq t \leq \frac{\sum_{i=1}^n \nu_i^2}{\max_{i \in [n]} \alpha_i };\\ 2\exp\big( -\frac{t^2}{2 \max_{i \in [n]} \alpha_i} \big) & t > \frac{\sum_{i=1}^n \nu_i^2}{\max_{i \in [n]} \alpha_i } \end{cases}.

As a direct consequence of this result, it is again straightforward to see that, if a martingale difference sequence is bounded, it is sub-exponential, and as such bounds such as Hoeffding’s inequality hold:

If {Δi}i1\{ \Delta_i \}_{i \geq 1} is a martingale difference sequence with filtration {Fi}i1\{ \mathcal{F}_i \}_{i \geq 1} with the property that Δi[ai,bi]\Delta_i \in [a_i, b_i] almost surely for all ii, then:

P[i=1nΔit]2e2t2/i=1n(biai)2. \mathbb{P}\Big[ \big\lvert \sum_{i=1}^n \Delta_i \big\rvert \geq t \Big] \leq 2e^{-2t^2/\sum_{i=1}^n (b_i - a_i)^2}.

Proof: Because Δi[ai,bi]\Delta_i \in [a_i, b_i] is bounded, it is sub-Gaussian with constant σi=(biai)/2\sigma_i=(b_i - a_i)/2 as shown here. This tells us that:

E[eλΔi    Fi1]eλ2(biai)2/8. \mathbb{E}\Big[ e^{\lambda \Delta_i} \;|\; \mathcal{F}_{i-1} \Big] \leq e^{\lambda^2 (b_i - a_i)^2/8}.

As a result, the sum of Δi\Delta_i’s is sub-Gaussian with constant i=1n(biai)2/2\sqrt{\sum_{i=1}^n (b_i - a_i)^2}/2, so that we may apply Hoeffding’s inequality and obtain the result above.

Concentration of Functions with Bounded Difference Property

Sub-exponential martingale difference sequences are super useful! Well, really, the fact that bounded random variables are sub-Gaussian is a helpful result that, combined with martingale difference sequences allows us to apply tail bounds to stochastic processes.

Martingales aren’t limited to sums of independent random varaibles, however. So it is natural to ask if we can extend concentration results to a more general construction such as the Doob martingale. Why is such a generalization useful? To answer that, let’s have a look at one important example.

Suppose we have a function f:RnRf:\mathbb{R}^n \rightarrow \mathbb{R} and we are interested in understanding the concentration of f(X)f(X) around its mean, E[f(X)]\mathbb{E}[f(X)]. Now, what does the Doob martingale look like? Well, Y0=E[f(X)]Y_0=\mathbb{E}[f(X)], Yn=f(X)Y_n=f(X), and Yk=E[f(X)    {Xi}i=1k]Y_k=\mathbb{E}[f(X) \;|\; \{ X_i \}_{i=1}^k]. But we can write:

f(X)E[f(X)]=YnY0=k=1n(YkYk1). f(X)-\mathbb{E}[f(X)] = Y_n - Y_0 = \sum_{k=1}^n (Y_k - Y_{k-1}).

The right-hand side is the sum of martingale differences! So indeed we may be able to impose some conditions on the function f()f(\cdot) and use our results from the previous section to bound the deviation of f(X)f(X) from its mean. The following result does exactly that:

McDiarmid’s Inequality: Suppose ff has the property that f(x)f(xk)αk\lvert f(x) - f(x^{\setminus k}) \rvert \leq \alpha_k where xkx^{\setminus k} denotes a copy of the vector xx that differs from xx in the kk-th dimension only. In other words, changing the kk-th coordinate of the input changes the output of ff by at most αk\alpha_k. If the random vector XX has independent components, then:

P[f(X)E[f(X)]t]2exp(2t2i=1nαi2). \mathbb{P}\Big[ \big\lvert f(X) - \mathbb{E}[f(X)] \big\rvert \geq t \Big] \leq 2 \exp(- \frac{2t^2}{\sum_{i=1}^n \alpha_i^2}).

Proof: We have already seen that f(X)E[f(X)]f(X) - \mathbb{E}[f(X)] can be expressed as the sum of a martingale difference sequence. If we showed that Δk=YkYk1\Delta_k=Y_k - Y_{k-1} belongs to an interval of length at most αk\alpha_k, then we can apply Hoeffding’s inequality to obtain the desired result. So we’ll focus on bounding Δk\Delta_k’s.

By definition, we have that:

Δk=E[f(X)    {Xi}i=1k]E[f(X)    {Xi}i=1k1]. \Delta_k=\mathbb{E}\Big[ f(X) \;|\; \{X_i\}_{i=1}^k \Big] - \mathbb{E}\Big[ f(X) \;|\; \{X_i\}_{i=1}^{k-1} \Big].

Now define the following random variables:

Uk=supxRE[f(X)    {X}ik1,x]E[f(X)    {X}ik1], U_k=\sup_{x \in \mathbb{R}} \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, x \Big]- \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1} \Big],

and,

Lk=infxRE[f(X)    {X}ik1,x]E[f(X)    {X}ik1]. L_k=\inf_{x \in \mathbb{R}} \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, x \Big]- \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1} \Big].

Notice that LkΔkUkL_k \leq \Delta_k \leq U_k. But now:

UkLk=supxRE[f(X)    {X}ik1,x]infxRE[f(X)    {X}ik1,x]supx,yRE[f(X)    {X}ik1,x]E[f(X)    {X}ik1,y]=supx,yRE{Xi}ik+1[f({Xi}ik1,x,{Xi}ik+1)]E{Xi}ik+1[f({Xi}ik1,y,{Xi}ik+1)]αk, \begin{aligned} U_k - L_k &= \sup_{x \in \mathbb{R}} \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, x \Big]- \inf_{x \in \mathbb{R}} \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, x \Big] \\ &\leq \sup_{x,y \in \mathbb{R}} \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, x \Big]- \mathbb{E}\Big[ f(X) \;|\; \{X\}_{i \leq k-1}, y \Big] \\ &= \sup_{x,y \in \mathbb{R}} \mathbb{E}_{\{X_i\}_{i \geq k+1}}\Big[ f(\{X_i\}_{i \leq k-1}, x, \{X_i\}_{i \geq k+1}) \Big] \\ &\quad\quad- \mathbb{E}_{\{X_i\}_{i \geq k+1}}\Big[ f(\{X_i\}_{i \leq k-1}, y, \{X_i\}_{i \geq k+1}) \Big] \leq \alpha_k, \end{aligned}

where we used the fact that E[f(X)    {xi}ik]=E{Xi}ik+1[f(x1,x2,,xk,Xk+1,,Xn)]\mathbb{E}[f(X) \;|\; \{x_i\}_{i \leq k}] = \mathbb{E}_{\{X_i\}_{i \geq k+1}}[f(x_1, x_2, \ldots, x_k, X_{k+1}, \ldots, X_n)].

Let’s illustrate the result by applying it to a rather ubiquitous random variable: If XRn\mathcal{X} \subset \mathbb{R}^n is a vector collection, then:

R(X):=Eξ[supuXu,ξ], \mathcal{R}(\mathcal{X}) := \mathbb{E}_\xi \big[\sup_{u \in \mathcal{X}} \langle u, \xi \rangle \big],

where ξ{1,1}n\xi \in \{-1, 1\}^n is a random Rademacher vector, is called the Rademacher complexity of the set X\mathcal{X}. Now consider the random variable Z=supuXu,ξZ=\sup_{u \in \mathcal{X}} \langle u, \xi \rangle and notice that it is bounded in the sense of the previous theorem:

supξ,ξkZZk=supξ,ξksupuXu,ξξk=supuXuk(ξkξkk)2supuXuk. \sup_{\xi, \xi^{\setminus k}} \lvert Z - Z^{\setminus k} \rvert = \sup_{\xi, \xi^{\setminus k}} \lvert \sup_{u \in \mathcal{X}} \langle u, \xi - \xi^{\setminus k} \rangle \rvert = \sup_{u \in \mathcal{X}} \lvert u_k (\xi_k - \xi^{\setminus k}_k) \rvert \leq 2\sup_{u \in \mathcal{X}} \lvert u_k \rvert.

As a result:

P[ZR(X)t]2exp(2t2α2), \mathbb{P}\Big[ \lvert Z - \mathcal{R}(\mathcal{X}) \rvert \geq t \Big] \leq 2\exp(-\frac{2t^2}{\alpha^2}),

where α2=4i=1nsupuXuk2\alpha^2=4\sum_{i=1}^n \sup_{u \in \mathcal{X}} u_k^2.