Upper and Lower Bounds of Sample Complexity of Supervised Learning
By the note on No-Free-Lunch, we concluded that there is no learning algorithms solving all problems with a fixed learning rate. It is because of the difficulty on controlling the approximation error. However, the property of the estimation error has been known since 1990s. These results can be found in several textbooks about statistical learning theory, e.g., (Vapnik, 1998; Devroye, Györfi, & Lugosi, 1997).
When for all ,
where and are absolute constants.
When for some ,
Note that the loss function considered here is the loss and the VC dimension is with respect to the loss function composed with the hypothesis. And the upper bound actually comes from the uniform convergence of the empirical risk to the expected risk over all the hypotheses in .
If we consider the convex loss functions instead of , we can use Rademacher complexity to obtain similar upper bounds.
Q: Lower bounds for this case?
The convex loss function case can be further generalized to the stochastic convex optimization problem. By (missing reference), even though for supervised learning problems, the estimation error of ERM can be provided by the uniform convergence results, ERM may not work at all for general case and naturally we won’t have uniform convergence in such a case. The counterexample is constructed in two steps. First, consider the finite dimensional space and sample size . The function , where represents the element-wise product and each coordinate of is independent Bernoulli random variables. The stochastic convex optimization problem we consider here is
The objective is -Lipschitz. Since , it is not hard to see that with probability greater than , there exists a coordinate not observed in any of the samples. Assume this coordinate is th. Then is the empirical minimizer attaining . However the expected objective is at . This is why in (Feldman, 2016), it claims that the lower bound has been proved in (missing reference). And actually Feldman shows that for the finite dimensional Lipschitz objective, the sample complexity of uniform convergence is . The sample complexity of ERM is and . Without Lipschitz condition, even in the finite dimensional space, there are stochastic convex optimization problems that can not be solved by the ERM.
Now the construction can be extended to the case and is a infinite sequence of independent Bernoulli variables. And there is a coordinate not being observed almost surely.
Even though in this case, the minimizer set of the empirical risk always contains , which is also the minimizer of the expected objective, there is a way to modify the objective to fix this problem by adding a deterministic term at the end.
Then, we apply the same argument, assume that th coordinate is not observed. Then for what ever the minimizer is for the unconstrained problem. In other words, the solution set of the unconstrained problem satisfies . Then the problem with constraint must attain its minimum at . However, note that
When is small enough, the solution set of the expected risk is contained in a small ball around the origin, which is far from the solution set of the empirical risk.
This example is not strongly convex either. For strongly convex case, ERM always works with samples, even though it still may not be uniform convergent. This rate is similar with the fast rate of regularized risks in the analysis of KSVM.
Reference
- Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.
- Devroye, L., Györfi, L., & Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition. Springer New York.
- Feldman, V. (2016). Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3576–3584). Curran Associates, Inc.