Universal Approximation Property of RKHS and Random Features (3)
We have seen the universal approximation property of RKHS generated by radial kernels and of one-hidden-layer neural networks with sigmoidal activation functions, in previous notes. However, they only confirm that the function class considered can approximate continuous functions defined on a compact space, and hence all functions, as well as we want. We do not know that how many samples we need, in the case of kernel method, or how many nodes we need, in the case of neural nets. To understand this, we need a more quantitative characterization of the approximation property.
(Barron, 1993) provides us such a characterization. It says that for any function in a class , we can find a one-hidden-layer neural network with nodes such that their distance is less than . is defined using the first moment of the frequency distribution of ,
where is a complex measure on and . It is easy to see that when the first moment condition is satisfied, the integral is well defined.
The flavor of the result and its proof is very similar with the following theorem, which is used in (Rahimi & Recht, 2009) to prove the approximation property of random features method.
Theorem. For i.i.d. random variables with values with the unit ball of a Hilbert space, with probability greater than , we have
This is a general high-probability result, but we can easily get an existence result by sending to 1 and end up with as the upper bound. One can expect that the Hilbert space in theorem is actually the space with respect to some probability measure over the compact space in Barron’s result. To fully understand the relation between this theorem and Barron’s result, we only need to figure out how to construct s from the sigmoidal functions and what is the constraint on .
(Barron, 1993) shows that the functions in are in the closure of the convex hull of the set of functions
The proof runs as follows. First, by the Fourier form of the function, we know that for ,
and thus, is a probability measure, denoted by . So we get that belongs to the closure of the convex combination of the set of functions
Then, we further decompose into two functions,
By the definition, . We can approximate using step functions uniformly over . In particular, note that is -Lipschitz, the total variation is always bounded by over . And since , for any to approximate , we have . So we get that belongs to the closure of the convex combination of the set of functions
Now we want to close the argument by showing that the step functions in can be approximated by . To see this we only need to consider the sequence , where . Then
pointwise. Then by dominant convergence theorem, this convergence also holds in . Therefore, we finally show that belongs to the closure of the convex hull of .
Reference
- Barron, A. R. (1993). Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory, 39, 930–945.
- Rahimi, A., & Recht, B. (2009). Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 1313–1320). Curran Associates, Inc.