Capacity of Neural Networks (1): Rademacher Complexity
Current sample complexity analysis of supervised learning heavily depends on the capacity analysis of the hypothesis classes. There are many different quantities characterizing the capacity. Among them the most widely used are VC-dimension and Rademacher/Gaussian complexity. In this series of notes, we will first prove some general properties of Rademacher/Gaussian complexity, then the capacity results of neural networks, and finally discuss the relation between VC-dimension and Rademacher complexity.
Rademacher and Gaussian Complexity
In this note, we use for a list of independent Rademacher random variables and for a list of independent Gaussian random variables. The so-called empirical process, which is the Banach-space-indexed random variables, has been studied intensively to understand properties of Banach spaces. It is defined in the following way,
where s are vectors in a Banach space . We are interested in the expectation of . For the quantity to be well defined; that is, measurable and integrable etc., some assumptions have to be made carefully. See (Ledoux & Talagrand, 1991) for details. We will not worry about these sophisticated aspects in the theory.
In our setup, the space is defined by . Together with the quantity playing the role of norm in Banach space
is a space analogous to the Banach space, although not quite the same. is a special case of the space we defined above, which is also a Banach space. For simplicity, we will denote by .
The Rademacher complexity is defined by , and similarly the Gaussian complexity is defined by . By Azuma’s inequality with bounded difference condition, the sample complexity of ERM over some hypothesis class is controlled by the Rademacher/Gaussian complexity of the hypothesis class. In the study of fast learning rate, with extra conditions and more powerful Talagrand’s inequality, the sample complexity is also converted to the control on Rademacher/Gaussian complexity.
Equivalence Between and
First, let’s show the equivalence of these two quantities; that is,
For the left part, we need the trick . So
where .
For the right part, note that . So
Here we need Kahane contraction principle,
where . To see why this is true, we only need to note that the function
is convex over the unit cube for any fixed . Therefore the function attains its maximum at extreme points; that is . And since is symmetric and independent, s and s have the same independent distribution. So by Kahane contraction principle, we have
The last inequality holds by upper bounding the norm of .
Different Variants of
There are several different definitions of Rademacher and Gaussian complexity. A very common variant takes the definition
or denoted by . If is symmetric; that is , . Otherwise, , and . Therefore, these two definitions are equivalent. Our definition however, enjoys a better contraction property when the hypothesis class is composed with a Lipschitz loss function.
Another possible variant of the definition is
for . Clearly this is strictly greater than and not useful whenever gets controlled.
Reference
- Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer Berlin Heidelberg.