Capacity of Neural Networks (1): Rademacher Complexity

2 minute read

Current sample complexity analysis of supervised learning heavily depends on the capacity analysis of the hypothesis classes. There are many different quantities characterizing the capacity. Among them the most widely used are VC-dimension and Rademacher/Gaussian complexity. In this series of notes, we will first prove some general properties of Rademacher/Gaussian complexity, then the capacity results of neural networks, and finally discuss the relation between VC-dimension and Rademacher complexity.

Rademacher and Gaussian Complexity

In this note, we use $\{\sigma_i\}$ for a list of independent Rademacher random variables and $\{g_i\}$ for a list of independent Gaussian random variables. The so-called empirical process, which is the Banach-space-indexed random variables, has been studied intensively to understand properties of Banach spaces. It is defined in the following way,

$\begin{equation*} X = \sum_{i=1}^{m} g_i x_i~, \end{equation*}$

where $x_i$ s are vectors in a Banach space $\mathcal{B}$ . We are interested in the expectation of $\Vert X\Vert$ . For the quantity to be well defined; that is, measurable and integrable etc., some assumptions have to be made carefully. See (Ledoux & Talagrand, 1991) for details. We will not worry about these sophisticated aspects in the theory.

In our setup, the space is defined by $\mathcal{H}(\mathbb{R}^d):=\{(h(x))_{h\in\mathcal{H}}\mid x\in\mathbb{R}^d\}$ . Together with the quantity playing the role of norm in Banach space

$\begin{equation*} \Vert (h(x))_{h\in\mathcal{H}}\Vert := \sup_{h\in\mathcal{H}} h(x)\,, \end{equation*}$

$(\mathcal{H}(\mathbb{R}^d),\Vert\cdot\Vert)$ is a space analogous to the Banach space, although not quite the same. $\ell_\infty$ is a special case of the space we defined above, which is also a Banach space. For simplicity, we will denote $(h(x))_{h\in\mathcal{H}}$ by $\mathcal{H}(x)$ .

The Rademacher complexity $\mathfrak{R}_m(\mathcal{H})$ is defined by $\frac{1}{m}\mathbb{E}\Vert\sum_{i=1}^m \sigma_i \mathcal{H}(x_i)\Vert$ , and similarly the Gaussian complexity $\mathfrak{G}_m(\mathcal{H})$ is defined by $\frac{1}{m}\mathbb{E}\Vert\sum_{i=1}^m g_i \mathcal{H}(x_i)\Vert$ . By Azuma’s inequality with bounded difference condition, the sample complexity of ERM over some hypothesis class $\mathcal{H}$ is controlled by the Rademacher/Gaussian complexity of the hypothesis class. In the study of fast learning rate, with extra conditions and more powerful Talagrand’s inequality, the sample complexity is also converted to the control on Rademacher/Gaussian complexity.

Equivalence Between $\mathfrak{R}_m$ and $\mathfrak{G}_m$

First, let’s show the equivalence of these two quantities; that is,

$\begin{equation*} c\mathfrak{R}_m(\mathcal{H}) \le \mathfrak{G}_m(\mathcal{H}) \le C\sqrt{\log(m)} \mathfrak{R}_m(\mathcal{H})~. \end{equation*}$

For the left part, we need the trick $g_i \sim \vert g_i\vert \sigma_i$ . So

$\begin{align*} \mathfrak{G}_m(\mathcal{H}) & =\frac{1}{m}\mathbb{E}\left\Vert\sum_{i=1}^m \sigma_i \vert g_i\vert \mathcal{H}(x_i) \right\Vert \\ & \ge \frac{1}{m}\mathbb{E} \left\Vert \mathbb{E}_{g} \sum\sigma_i \vert g_i\vert \mathcal{H}(x_i) \right\Vert \tag{Jensen}\\ & = c\mathfrak{R}_m(\mathcal{H})~, \end{align*}$

where $c=\mathbb{E}\vert g\vert$ .

For the right part, note that $g_i \sim g_i\sigma_i$ . So

$\begin{align*} \mathfrak{G}_m(\mathcal{H}) & =\frac{1}{m}\mathbb{E}\left\Vert\sum_{i=1}^m \sigma_i g_i \mathcal{H}(x_i) \right\Vert \\ & = \frac{1}{m}\mathbb{E}\left\Vert\sum \max_i{\vert g_i\vert }\frac{g_i}{\max_i{\vert g_i\vert }}\sigma_i \mathcal{H}(x_i) \right\Vert \\ & = \frac{1}{m}\mathbb{E}_g \max_i \vert g_i\vert \mathbb{E}_\sigma\left\Vert\sum \frac{g_i}{\max_i{\vert g_i\vert }}\sigma_i \mathcal{H}(x_i) \right\Vert~. \end{align*}$

Here we need Kahane contraction principle,

$\begin{equation*} \mathbb{E}_\sigma\left\Vert\sum c_i\sigma_i \mathcal{H}(x_i)\right\Vert \le \mathbb{E}_\sigma\Vert\sum\sigma_i\mathcal{H}(x_i)\Vert~, \end{equation*}$

where $\vert c_i\vert \le 1$ . To see why this is true, we only need to note that the function

$\begin{align*} f: C & \to\mathbb{R} \\ \vec{c} & \mapsto \Vert\sum c_i\sigma_i \mathcal{H}(x_i)\Vert \\ \end{align*}$

is convex over the unit cube $C$ for any fixed $\sigma_i$ . Therefore the function attains its maximum at extreme points; that is $\vert c_i\vert =1$ . And since $\sigma_i$ is symmetric and independent, $\sigma_i$ s and $-\sigma_i$ s have the same independent distribution. So by Kahane contraction principle, we have

$\begin{align*} \mathfrak{G}_m{\mathcal{H}} & \le \frac{1}{m}\mathbb{E}_g\max_i\vert g_i\vert \mathbb{E}_\sigma\left\Vert\sum\sigma_i \mathcal{H}(x_i)\right\Vert \\ & \le \frac{C\sqrt{\log m}}{m}\mathfrak{R}_m(\mathcal{H})~. \end{align*}$

The last inequality holds by upper bounding the $\psi_2$ norm of $\max_i\vert g_i\vert$ .

Different Variants of $\mathfrak{R}_m$

There are several different definitions of Rademacher and Gaussian complexity. A very common variant takes the definition

$\begin{equation*} \mathfrak{R}_m(\mathcal{H}) := \frac{1}{m}\mathbb{E}\sup_{h\in\mathcal{H}} \left\vert \sum_{i=1}^m \sigma_ih(x_i)\right\vert \,, \end{equation*}$

or denoted by $\vert \mathfrak{R}_m\vert (\mathcal{H})$ . If $\mathcal{H}$ is symmetric; that is $h\in\mathcal{H}\implies -h\in\mathcal{H}$ , $\mathfrak{R}_m(\mathcal{H})=\vert \mathfrak{R}_m\vert (\mathcal{H})$ . Otherwise, $\mathfrak{R}_m(\mathcal{H}) \le \vert \mathfrak{R}_m\vert (\mathcal{H})$ , and $\vert \mathfrak{R}_m\vert (\mathcal{H}) = \mathfrak{R}_m(\mathcal{H}\cup-\mathcal{H})\le 2\mathfrak{R}_m(\mathcal{H})$ . Therefore, these two definitions are equivalent. Our definition however, enjoys a better contraction property when the hypothesis class is composed with a Lipschitz loss function.

Another possible variant of the definition is

$\begin{equation*} \mathfrak{R}_{m,p}(\mathcal{H}) := \frac{1}{m}\left[\mathbb{E}\sup_{h\in\mathcal{H}}\left(\sum_{i=1}^m \sigma_ih(x_i)\right)^p\right]^{1/p}\,, \end{equation*}$

for $p>1$ . Clearly this is strictly greater than $\mathfrak{R}_m$ and not useful whenever $\mathfrak{R}_m$ gets controlled.

Reference

Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer Berlin Heidelberg.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Capacity of Neural Networks (1): Rademacher Complexity

Rademacher and Gaussian Complexity

Equivalence Between $\mathfrak{R}_m$ and $\mathfrak{G}_m$

Different Variants of $\mathfrak{R}_m$

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置

Yitong Sun

Rademacher and Gaussian Complexity

Equivalence Between \mathfrak{R}_m and \mathfrak{G}_m

Different Variants of \mathfrak{R}_m

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置

Equivalence Between $\mathfrak{R}_m$ and $\mathfrak{G}_m$

Different Variants of $\mathfrak{R}_m$