Capacity of Neural Networks (3): Main Results

1 minute read

In this note, we will look at the Rademacher complexity of 2-layer neural networks, and compare it with the result of kernel method. This is the main result of (Bartlett & Mendelson, 2003). We rephrase it as follows.

Theorem. Suppose that $\phi: \mathbb{R}\to [-1,1]$ has Lipschitz constant $L$ . Define the class computed by a two-layer neural network with weights norm constraints as

$\begin{equation} F=\left\{x\mapsto \sum_i w_i\phi(v_i\cdot x) \mid \Vert w\Vert_1 \le 1, \Vert v_i \Vert_2 \le B\right\}\,. \end{equation}$

Then for $x_1,\ldots,x_n\in \mathbb{R}^d$ ,

$\begin{equation} \hat{\mathfrak{G}}_m(F) \le \frac{cLB}{\sqrt{m}}\max_i\Vert x_i\Vert_2\,, \end{equation}$

where $x_i = (x_{i1},\ldots,x_{id})$ .

Compared with the original statement, we drop the assumption $\phi(0)=0$ , which is not necessary in our concise form of contraction inequality. And we change the constraint over the weights of the bottom layer from 1-norm to 2-norm. This gives us the result of a better form and simpler proof.

With 2-norm constraint, the proof of this theorem can be easily obtained by the contraction inequality.

$\begin{align} m\hat{\mathfrak{G}}_m(F) & = \mathbb{E}_g\sup_{w,v_js}\sum_{i=1}^m g_i\sum_j w_j\phi(v_j\cdot x_i) \\ & = \mathbb{E}_g\sup_{w,v_js}\sum_j w_j \sum_{i=1}^m g_i\phi(v_j\cdot x_i) \\ & \le \mathbb{E}_g\sup_{v}\sum_{i=1}^mg_i\phi(v\cdot x_i) \\ & \le L\mathbb{E}_g\sup_{v}\sum_{i=1}^mg_iv\cdot x_i \tag{1}\\ & \le LB\mathbb{E}_g \left\Vert \sum_{i=1}^m g_ix_i \right\Vert_2 \\ & = LB\mathbb{E}_g\left(\left(\sum g_i x_i\right)^2\right)^{1/2} \\ & \le LB\left(\mathbb{E}\left(\sum g_i x_i\right)^2\right)^{1/2} \\ & \le LB(\sqrt{m}\max_i \Vert x_i\Vert)\,. \end{align}$

This case is actually similar with the Rademacher complexity of the kernel method, where instead of $\max_i \Vert x_i \Vert$ , $\max_i \langle \phi(x_i),\phi(x_i) \rangle^{1/2} \le \sup_x k(x,x)^{1/2}$ controls the complexity.

In the original theorem, $\Vert v_i \Vert_1\le B$ . Then (1) is less than $B\mathbb{E}\max_j \sum_i g_i x_{ij}$ . To control this quantity, Sudakov-Fernique’s inequality is used, which is a variant of Slepian’s inequality

Theorem. (Sudakov-Fernique’s inequality) Let $(X_t)_{t\in T}$ and $(Y_t)_{t\in T}$ be two Gaussian processes. Assume that for all $t,s\in T$ , we have

$\begin{equation} \mathbb{E}(X_t - X_s)^2 \le \mathbb{E}(Y_t - Y_s)^2\,. \end{equation}$

Then,

$\begin{equation} \mathbb{E}\sup_{t\in T}X_t \le \mathbb{E}\sup_{t\in T}Y_t\,. \end{equation}$

To use this theorem to prove the 1-norm constraint case, we set $X_j = \sum_i g_ix_{ij}$ and $Y_j = g_j \max_{j,k}\left(\mathbb{E}(X_j - X_k)^2\right)^{1/2}$ . Then,

$\begin{align} \mathbb{E}(Y_j - Y_k)^2 & = \mathbb{E}(g_j - g_k)^2\max_{j,k}\mathbb{E}(X_j - X_k)^2 \\ & = 2\max_{j,k}\mathbb{E}(X_j - X_k)^2 \\ & \ge \mathbb{E}(X_j - X_k)^2\,. \end{align}$

And according to Sudakov-Fernique’s inequality, we have

$\begin{align} \mathbb{E}\max_j\sum_i g_ix_{ij} & = \mathbb{E}\max_j X_j \\ & \le \mathbb{E}\max_j Y_j \\ & = \max_{j,k}\left(\mathbb{E}(X_j-X_k)^2\right)^{1/2}\mathbb{E}\max_j g_j \\ & \le c\sqrt{\ln{d}}\max_{j,k}\left(\mathbb{E}(X_j-X_k)^2\right)^{1/2} \\ & = c\sqrt{\ln d}\max_{j,k}\Vert x_{:j}-x_{:k}\Vert\,. \end{align}$

The Rademacher complexity can be easily obtained by using the fact that $c\mathfrak{R}_m\le\mathfrak{G}_m$ .

Reference

Bartlett, P. L., & Mendelson, S. (2003). Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. J. Mach. Learn. Res., 3, 463–482.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Capacity of Neural Networks (3): Main Results

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置