Universal Approximation Property of RKHS and Random Features (2)
Universal Approximation Property of RKHS
In this note, we discuss the universal approximation property of RKHS and compare with the property of neural network. The material is mainly based on (Micchelli, Xu, & Zhang, 2006) and (Cybenko, 1989).
The universal approximation property says that the hypothesis class accessible by the learning model is dense in some common function class. For continuous kernels, its hypothesis class is a subset of continuous functions, and thus we naturally consider whether it can approximates , where is a compact subset of under the sup norm. In functional analysis, to prove that the subspace spanned by a subset is dense in a space, we only need to show that its only annihilator is , which is a consequence of Hahn-Banach theorem.
The dual space of consists of all the complex valued Radon measures on , denoted by . For a RKHS generated by a continuous kernel , assume that is its feature space and its feature map. We can define a map from to to be the corresponding element in such that
Note that is a continuous function in the RKHS of , so the map is well-defined. We can denote the image of by and the switch between integral and inner product is justified. And it follows this definition immediately that . For each kernel, its RKHS is unique, but its features spaces are not. That lends us more tools to study the annihilator of .
For example, when is the translation invariant kernel, we can apply Bochner’s theorem to construct its features space as , where is a finite Borel measure, and the feature map is . So the question is converted to: if
does have to be ? Obviously, this question is related with the support of . If the support of is , then (*) implies that has to be measure. This is the case of Gaussian kernel. However, if the support of is just a single point, (*) does not imply that . This question is related with the set of uniqueness in harmonic analysis.
For radial basis kernel of the form , it can always written into the form
where is a finite Borel measure on . Using multinomial Taylor expansion, we can construct the feature map
with inner product defined by
For such a feature map, whenever contains any positive number , the map is injective. Indeed, of all comprise all the polynomials over , which is dense in and implies that must be if .
Universal Approximation Property of Neural Networks
Now we consider the universal approximation property of neural networks. One hidden layer neural network has the form
The first subscript of the coefficients indicates the layer, the second and the third indicates its node and the third for the coordinate. The parameters within layer forms a matrix of shape , where represents the number of nodes in layer . We denote the linear span of the set by . To show that it is dense in , we also consider its annihilator. If a signed/complex measure on satisfies that
for all in , we want to show that it must be constantly everywhere. Here we can see that it actually plays the role of the canonical feature maps in the kernel method. To show that has to be constantly , we note that for any ,
If is bounded and sigmoidal, that is, and , then by dominant convergence theorem, we can show that
Therefore for all . This implies that must be constantly .
Actually, if we assign a probability distribution over the parameter space , determines a feature map . The corresponding kernel is defined by
For the RKHS generated by to be dense in , we need some requirements on . An obvious sufficient condition is that . In this case, the RKHS generated by can be written in the form
For any in this RKHS, we can always use
to approximate it. The quality of this approximation under norm is discussed in Bach’s work.
Reference
- Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal Kernels. Journal of Machine Learning Research, 7, 2651–2667.
- Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems, 2, 303–314.