Universal Approximation Property of RKHS and Random Features (2)

3 minute read

Universal Approximation Property of RKHS

In this note, we discuss the universal approximation property of RKHS and compare with the property of neural network. The material is mainly based on (Micchelli, Xu, & Zhang, 2006) and (Cybenko, 1989).

The universal approximation property says that the hypothesis class accessible by the learning model is dense in some common function class. For continuous kernels, its hypothesis class is a subset of continuous functions, and thus we naturally consider whether it can approximates $C(\mathcal{X})$ , where $\mathcal{X}$ is a compact subset of $\mathbb{R}^d$ under the sup norm. In functional analysis, to prove that the subspace spanned by a subset is dense in a space, we only need to show that its only annihilator is $0$ , which is a consequence of Hahn-Banach theorem.

The dual space of $C(\mathcal{X})$ consists of all the complex valued Radon measures on $\mathcal{X}$ , denoted by $B(\mathcal{X})$ . For a RKHS generated by a continuous kernel $k$ , assume that $W$ is its feature space and $\phi:\mathcal{X}\to W$ its feature map. We can define a map $U$ from $B(\mathcal{X})$ to $W$ to be the corresponding element in $W$ such that

$\begin{equation} \langle U(\nu), w\rangle = \int_\mathcal{X}\langle \phi(x),w\rangle~\mathrm{d}\nu\,. \end{equation}$

Note that $\langle\phi(x),w\rangle$ is a continuous function in the RKHS of $k$ , so the map $U$ is well-defined. We can denote the image of $\nu$ by $\int\phi(x)~\mathrm{d}\nu$ and the switch between integral and inner product is justified. And it follows this definition immediately that $\mathcal{H}_k^\bot = \ker(U)$ . For each kernel, its RKHS is unique, but its features spaces are not. That lends us more tools to study the annihilator of $\mathcal{H}_k$ .

For example, when $k$ is the translation invariant kernel, we can apply Bochner’s theorem to construct its features space as $L^2(\mathbb{R}^d,\mathbb{P})$ , where $\mathbb{P}$ is a finite Borel measure, and the feature map is $\phi(x)=e^{ix\cdot w}$ . So the question is converted to: if

$\begin{equation} \int_\mathcal{X}e^{i x\cdot w}~\mathrm{d}\nu = 0\quad\forall w\in\mathrm{supp}(\mathbb{P})\,,\tag{*} \end{equation}$

does $\nu$ have to be $0$ ? Obviously, this question is related with the support of $\mathbb{P}$ . If the support of $\mathbb{P}$ is $\mathbb{R}^d$ , then (*) implies that $\nu$ has to be $0$ measure. This is the case of Gaussian kernel. However, if the support of $\mathbb{P}$ is just a single point, (*) does not imply that $\nu=0$ . This question is related with the set of uniqueness in harmonic analysis.

For radial basis kernel of the form $g(\Vert x-y \Vert^2)$ , it can always written into the form

$\begin{equation} g(\Vert x-y \Vert^2):=\int_{\mathbb{R}^+} e^{-\sigma\Vert x-y \Vert^2}~\mathrm{d}\mu(\sigma)\,, \end{equation}$

where $\mu$ is a finite Borel measure on $\mathbb{R}^d$ . Using multinomial Taylor expansion, we can construct the feature map

$\begin{equation} \phi(x):=\sigma^{\vert\alpha\vert/2}x^\alpha e^{-\sigma\Vert x\Vert^2}\,, \end{equation}$

with inner product defined by

$\begin{equation} \langle \phi(x),\phi(x')\rangle = \sum_{\alpha\in\mathbb{Z}^d_+} \begin{pmatrix} \vert \alpha \vert \\ \alpha \end{pmatrix} \frac{2^{\vert\alpha\vert}}{\vert\alpha\vert!}\int_{\mathbb{R}^+}\phi(x)\phi(x')~\mathrm{d}\mu(\sigma)\,. \end{equation}$

For such a feature map, whenever $\mathrm{supp}(\mu)$ contains any positive number $\rho$ , the map $U$ is injective. Indeed, $\rho^{\vert\alpha\vert/2}x^\alpha e^{-\rho\Vert x\Vert^2}$ of all $\alpha$ comprise all the polynomials over $\mathcal{X}$ , which is dense in $C(\mathcal{X})$ and implies that $\nu$ must be $0$ if $\int\phi(x)~\mathrm{d}\nu=0$ .

Universal Approximation Property of Neural Networks

Now we consider the universal approximation property of neural networks. One hidden layer neural network has the form

$\begin{equation} f(x) = \sum_{i=1}^N w_{2i}\sigma(w_{1i:}\cdot x+b_{1i}) + b_2\,. \end{equation}$

The first subscript of the coefficients indicates the layer, the second and the third indicates its node and the third for the coordinate. The parameters $w,b$ within layer $i$ forms a matrix of shape $N_i\times (N_{i+1}+1)$ , where $N_j$ represents the number of nodes in layer $j$ . We denote the linear span of the set $\{ \sigma(w\cdot x+b),1 \}$ by $\mathcal{H}_\sigma$ . To show that it is dense in $C(\mathcal{X})$ , we also consider its annihilator. If a signed/complex measure $\nu$ on $\mathcal{X}$ satisfies that

$\begin{equation} \int_\mathcal{X} \sigma(w\cdot x+b)~\mathrm{d}\nu=0\,, \end{equation}$

for all $(w,b)$ in $\mathbb{R}^{d+1}$ , we want to show that it must be constantly $0$ everywhere. Here we can see that it actually plays the role of the canonical feature maps in the kernel method. To show that $\nu$ has to be constantly $0$ , we note that for any $x,w,b$ ,

$\begin{equation} \sigma(s(w\cdot x+b)+t)\to \begin{cases} \lim_{u\to\infty}\sigma(u) & w\cdot x+b>0\,; \\ \sigma(t) & w\cdot x+b=0\,; \\ \lim_{u\to -\infty}\sigma(u) & w\cdot x+b<0\,. \end{cases} \end{equation}$

If $\sigma$ is bounded and sigmoidal, that is, $\lim_{u\to\infty}\sigma(u)=1$ and $\lim_{u\to-\infty}\sigma(u)=0$ , then by dominant convergence theorem, we can show that

$\begin{equation} \nu(w\cdot x+b>0) + \sigma(t)\nu(w\cdot x+b=0) = 0\quad\forall t\in\mathbb{R}\,, \end{equation}$

Therefore $\nu(w\cdot x+b>0)=0$ for all $w,b$ . This implies that $\nu$ must be constantly $0$ .

Actually, if we assign a probability distribution $\mu$ over the parameter space $w,b\in \mathbb{R}^{d}\times\mathbb{R}$ , $\sigma$ determines a feature map $\phi:\mathcal{X}\to L^2(\mu)$ . The corresponding kernel is defined by

$\begin{equation} k(x,x'):=\int_{\mathbb{R}^{d+1}} \sigma(w\cdot x+b)\sigma(w\cdot x'+b)~\mathrm{d}\mu(w,b)\,. \end{equation}$

For the RKHS generated by $k$ to be dense in $C(\mathcal{X})$ , we need some requirements on $\mu$ . An obvious sufficient condition is that $\mathrm{supp}(\mu)=\mathbb{R}^{d+1}$ . In this case, the RKHS generated by $k$ can be written in the form

$\begin{equation} \left\{ f(x)=\int_{\mathbb{R}^{d+1}} \sigma(w\cdot x+b)g(w,b)~\mathrm{d}\mu(w,b): g(w,b)\in L^2(\mu)\right\}\,. \end{equation}$

For any $f$ in this RKHS, we can always use

$\begin{equation} \frac{1}{N}\sum_{i=1}^N g(w_{1i},b_{1i})\sigma(w_{1i:}\cdot x+b_{1i}) \end{equation}$

to approximate it. The quality of this approximation under $L^2$ norm is discussed in Bach’s work.

Reference

Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal Kernels. Journal of Machine Learning Research, 7, 2651–2667.
Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems, 2, 303–314.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Universal Approximation Property of RKHS and Random Features (2)