Capacity of Neural Networks (5): Sauer’s Lemma

1 minute read

In this note we prove the Sauer’s lemma which plays the key role in establishing the connection between VC-dimension and Rademacher complexity. We use the proof in Section 8.3 of (Vershynin, 2017). There are two steps in the proof. First, we prove Pajor’s lemma that for any boolean function class $\mathcal{F}$ defined on a finite set $\Omega$ , its cardinality is bounded by

$\begin{equation} \vert\{\Lambda\subset\Omega\mid \Lambda \text{ is shattered by }\mathcal{F}\}\vert\,. \end{equation}$

Then if the VC-dimension of $\mathcal{F}$ is $D$ and $\vert\Omega\vert = n$ , the maximum number of shattered subsets of $\Omega$ is

$\begin{align} \sum_{k=0}^D \pmatrix{n \\ k} & \le \left(\frac{n}{D}\right)^D\sum_{k=0}^D\pmatrix{n \\ k}\left(\frac{D}{n}\right)^k \\ & \le \left(\frac{n}{D}\right)^D\left(1 + \frac{D}{n}\right)^n \\ & \le \left(\frac{n}{D}\right)^De^D\,. \end{align}$

Now let’s use induction to prove Pajor’s lemma. If $\vert\Omega\vert = 1$ , then there are either 1 or 2 boolean functions defined on $\Omega$ . If there is 1 function, then no subset can be shattered. We take the convention that $\vert\emptyset\vert = 1$ . If there are 2 functions, then all subsets can be shattered, so the cardinality of shattered subsets is 2. Assuming the conclusion holds for $\vert\Omega\vert = n$ , for an $\Omega$ with $n+1$ points, we take out one point $x_0$ and denote the rest points by $\Omega_n$ . Then the function class $\mathcal{F}$ can be splitted into two disjoint subsets $\mathcal{F}_0$ and $\mathcal{F}_1$ according to their values at $x_0$ . Denote the class of subsets shattered by $\mathcal{F}_0,\mathcal{F}_1$ in $\Omega_0$ by $S_0$ and $S_1$ , respectively. By induction hypothesis, $\vert\mathcal{F}_0\vert \le \vert S_0\vert$ and $\vert\mathcal{F}_1\vert \le \vert S_1\vert$ . Since $\vert\mathcal{F}\vert = \vert\mathcal{F}_0\vert + \vert\mathcal{F}_1\vert$ , if we can show that $\vert S_0\vert+\vert S_1\vert\le\vert S\vert$ , then we are done. For the subsets in $S_0$ , they can be shattered by $\mathcal{F}_0$ ; for the subsets in $S_1-S_0$ , they can be shattered by $\mathcal{F}_1$ . For the subsets $A$ in $S_0 \cap S_1$ , we consider $A\cup{x_0}$ and it can be shattered by $\mathcal{F}$ because if the label of $x_0$ is 0, we can find a function in $\mathcal{F}_0$ to shatter it and if the label of $x_0$ is 1, we can find a function in $\mathcal{F}_1$ to shatter it.

Reference

Vershynin, R. (2017). High Dimensional Probability.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Capacity of Neural Networks (5): Sauer’s Lemma

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置