Universal Approximation Property of RKHS and Random Features (3)

2 minute read

We have seen the universal approximation property of RKHS generated by radial kernels and of one-hidden-layer neural networks with sigmoidal activation functions, in previous notes. However, they only confirm that the function class considered can approximate continuous functions defined on a compact space, and hence all $L^p$ functions, as well as we want. We do not know that how many samples we need, in the case of kernel method, or how many nodes we need, in the case of neural nets. To understand this, we need a more quantitative characterization of the approximation property.

(Barron, 1993) provides us such a characterization. It says that for any function $f$ in a class $\Gamma_C$ , we can find a one-hidden-layer neural network with $n$ nodes such that their $L^2$ distance is less than $C/\sqrt{n}$ . $\Gamma_C$ is defined using the first moment of the frequency distribution of $f$ ,

$\begin{equation} \Gamma_C := \left\{ f(x)=f(0)+\int \left(e^{i\omega\cdot x}-1\right)~\mathrm{d}\rho(\omega) : \int \vert\omega\vert_\mathcal{X}~\vert\mathrm{d}\rho(\omega)\vert < C \right\}\,, \end{equation}$

where $\rho$ is a complex measure on $\mathbb{R}^d$ and $\vert\omega\vert_\mathcal{X}=\sup_{x\in\mathcal{X}}\vert\omega\cdot x\vert$ . It is easy to see that when the first moment condition is satisfied, the integral $\int (e^{i\omega\cdot x}-1)~\mathrm{d}\rho(\omega)$ is well defined.

The flavor of the result and its proof is very similar with the following theorem, which is used in (Rahimi & Recht, 2009) to prove the approximation property of random features method.

Theorem. For $n$ i.i.d. random variables $X_1,\ldots,X_n$ with values with the unit ball of a Hilbert space, with probability greater than $1-\delta$ , we have

$\begin{equation} \left\Vert \frac{1}{n}\sum_{i=1}^n X_i - \mathbb{E}X_1 \right\Vert \le \frac{1}{\sqrt{n}}\left(1+\sqrt{2\log\frac{1}{\delta}}\right)\,. \end{equation}$

This is a general high-probability result, but we can easily get an existence result by sending $\delta$ to 1 and end up with $1/\sqrt{n}$ as the upper bound. One can expect that the Hilbert space in theorem is actually the $L^2$ space with respect to some probability measure $\mu$ over the compact space $\mathcal{X}$ in Barron’s result. To fully understand the relation between this theorem and Barron’s result, we only need to figure out how to construct $X_i$ s from the sigmoidal functions $\sigma$ and what is the constraint on $\mathbb{E}X_1$ .

(Barron, 1993) shows that the functions in $\Gamma_C$ are in the closure of the convex hull of the set of functions

$\begin{equation} G_\sigma:=\left\{ \gamma\sigma(w\cdot x+b): \vert\gamma\vert<2C \right\}\,. \end{equation}$

The proof runs as follows. First, by the Fourier form of the function, we know that for $\omega\ne 0$ ,

$\begin{align} f(x) - f(0) & = \int \cos(\omega\cdot x+b(\omega))-\cos(b(\omega))~\mathrm{d}\vert\rho(\omega)\vert \\ & = \int \frac{C_f}{\vert\omega\vert_\mathcal{X}}\left(\cos(\omega\cdot x+b(\omega))-\cos(b(\omega))\right)~\frac{\vert\omega\vert_\mathcal{X}}{C_f}\vert\mathrm{d}\rho(\omega)\vert\,. \end{align}$

$C_f=\int\vert\omega\vert_\mathcal{X}~\vert\mathrm{d}\rho\vert \le C$ and thus, $\vert\omega\vert_\mathcal{X}\vert\mathrm{d}\rho\vert/C_f$ is a probability measure, denoted by $\mathrm{d}\Lambda$ . So we get that $f(x)-f(0)$ belongs to the closure of the convex combination of the set of functions

$\begin{equation} G_{\cos} := \left\{ \frac{\gamma}{\vert\omega\vert_\mathcal{X}}\cos(\omega\cdot x+b(\omega))-\cos(b(\omega)): \vert\gamma\vert\le C, b\in\mathbb{R} \right\} \end{equation}$

Then, we further decompose $g(\omega,b)\in G_{\cos}$ into two functions,

$\begin{align} g(z) & =\frac{\gamma}{\vert\omega\vert_\mathcal{X}}(\cos(\vert\omega\vert_\mathcal{X}z+b)-\cos(b)) \\ z & = \frac{\omega}{\vert\omega\vert_\mathcal{X}}\cdot x \end{align}$

By the definition, $\vert z\vert\le 1$ . We can approximate $g(z)$ using step functions uniformly over $[-1,1]$ . In particular, note that $g(z)$ is $C$ -Lipschitz, the total variation is always bounded by $2C$ over $[-1,1]$ . And since $g(0)=0$ , for any $\sum \alpha_i\mathbb{1}_{\{z>t_i\}}$ to approximate $g(z)$ , we have $\sum\vert\alpha_i\vert\le 2C$ . So we get that $f(x)-f(0)$ belongs to the closure of the convex combination of the set of functions

$\begin{equation} G_{\text{step}}:=\left\{ \gamma\mathbf{1}_{\{\alpha\cdot x+b>0\}}:\vert\gamma\vert<2C,\vert\alpha\vert_\mathcal{X}=1,\vert b\vert\le 1 \right\}\,. \end{equation}$

Now we want to close the argument by showing that the step functions in $G_\text{step}$ can be approximated by $G_\sigma$ . To see this we only need to consider the sequence $\sigma_n = (s_n^2(\alpha\cdot x+b)-s_n)$ , where $s_n\to\infty$ . Then

$\begin{equation} \sigma_n\to \begin{cases} 1\quad & \alpha\cdot x + b > 0\,; \\ 0\quad & \alpha\cdot x + b \le 0\,, \end{cases} \end{equation}$

pointwise. Then by dominant convergence theorem, this convergence also holds in $L^2(\mathcal{X},\mu)$ . Therefore, we finally show that $f(x)-f(0)$ belongs to the closure of the convex hull of $G_\sigma$ .

Reference

Barron, A. R. (1993). Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory, 39, 930–945.
Rahimi, A., & Recht, B. (2009). Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 1313–1320). Curran Associates, Inc.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Universal Approximation Property of RKHS and Random Features (3)

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置