Upper and Lower Bounds of Sample Complexity of Supervised Learning

3 minute read

By the note on No-Free-Lunch, we concluded that there is no learning algorithms solving all problems with a fixed learning rate. It is because of the difficulty on controlling the approximation error. However, the property of the estimation error has been known since 1990s. These results can be found in several textbooks about statistical learning theory, e.g., (Vapnik, 1998; Devroye, Györfi, & Lugosi, 1997).

When $\inf_{f\in\mathcal{H}}R_\mathbb{P}(f)=0$ for all $\mathbb{P}\in\Pi$ ,

$\begin{equation} c\frac{d_{VC}}{m}\le\inf_\mathcal{A}\sup_{\mathbb{P}}R_\mathbb{P}(f_\mathcal{A,m})\le\sup_\mathbb{P}R_\mathbb{P}(f_{ERM,m})\le C\frac{d_{VC}\ln m}{m}\,, \end{equation}$

where $c$ and $C$ are absolute constants.

When $\inf_{f\in\mathcal{H}}R_\mathbb{P}(f)\ne0$ for some $\mathbb{P}\in\Pi$ ,

$\begin{equation} c\sqrt{\frac{d_{VC}}{m}}\le\inf_\mathcal{A}\sup_{\mathbb{P}}R_\mathbb{P}(f_\mathcal{A,m})-R^*_\mathbb{P,\mathcal{H}}\le\sup_\mathbb{P}R_\mathbb{P}(f_{ERM,m})-R^*_\mathbb{P,\mathcal{H}}\le C\sqrt{\frac{d_{VC}\ln m}{m}}\,. \end{equation}$

Note that the loss function considered here is the $0-1$ loss and the VC dimension is with respect to the loss function composed with the hypothesis. And the upper bound actually comes from the uniform convergence of the empirical risk to the expected risk over all the hypotheses in $\mathcal{H}$ .

If we consider the convex loss functions instead of $0-1$ , we can use Rademacher complexity to obtain similar upper bounds.

Q: Lower bounds for this case?

The convex loss function case can be further generalized to the stochastic convex optimization problem. By (missing reference), even though for supervised learning problems, the estimation error of ERM can be provided by the uniform convergence results, ERM may not work at all for general case and naturally we won’t have uniform convergence in such a case. The counterexample is constructed in two steps. First, consider the finite dimensional space $\mathbb{R}^d$ and sample size $n<\log_2 d$ . The function $f_\alpha(w)=\Vert\alpha\circ w\Vert$ , where $\circ$ represents the element-wise product and each coordinate of $\alpha$ is independent Bernoulli random variables. The stochastic convex optimization problem we consider here is

$\begin{equation} \text{minimize}\quad \mathop{\mathbb{E}}\limits_\alpha f_\alpha(w)\quad\text{subject to}\quad \Vert w\Vert\le1\,. \end{equation}$

The objective is $1$ -Lipschitz. Since $n<\log_2 d$ , it is not hard to see that with probability greater than $1-e^{-1}$ , there exists a coordinate not observed in any of the $n$ samples. Assume this coordinate is $j$ th. Then $e_j$ is the empirical minimizer attaining $0$ . However the expected objective is $1/2$ at $e_j$ . This is why in (Feldman, 2016), it claims that the lower bound $n>\log d$ has been proved in (missing reference). And actually Feldman shows that for the finite dimensional Lipschitz objective, the sample complexity of uniform convergence is $\theta(d/\epsilon^2)$ . The sample complexity of ERM is $\Omega(d/\epsilon)$ and $O(d/\epsilon^2)$ . Without Lipschitz condition, even in the finite dimensional space, there are stochastic convex optimization problems that can not be solved by the ERM.

Now the construction can be extended to the case $w\in\ell^2$ and $\alpha$ is a infinite sequence of independent Bernoulli variables. And there is a coordinate not being observed almost surely.

Even though in this case, the minimizer set of the empirical risk always contains $0$ , which is also the minimizer of the expected objective, there is a way to modify the objective to fix this problem by adding a deterministic term at the end.

$\begin{equation} g_\alpha(w)=f_\alpha(w)+\epsilon\sum_{i=1}^\infty2^{-i}(w_i-1)^2\,. \end{equation}$

Then, we apply the same argument, assume that $j$ th coordinate is not observed. Then $w_j=1$ for what ever the minimizer is for the unconstrained problem. In other words, the solution set of the unconstrained problem satisfies $\Vert w\Vert\ge1$ . Then the problem with constraint $\Vert w\Vert\le 1$ must attain its minimum at $\Vert w\Vert=1$ . However, note that

$\begin{equation} \mathop{\mathbb{E}}_\alpha g_\alpha(0)=\epsilon\le\inf_{\Vert w\Vert\ge\epsilon}\mathop{\mathbb{E}}_\alpha g_\alpha(w)\,. \end{equation}$

When $\epsilon$ is small enough, the solution set of the expected risk is contained in a small ball around the origin, which is far from the solution set of the empirical risk.

This example is not strongly convex either. For strongly convex case, ERM always works with $O(\frac{\epsilon}{d\lambda})$ samples, even though it still may not be uniform convergent. This rate is similar with the fast rate of regularized risks in the analysis of KSVM.

Reference

Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.
Devroye, L., Györfi, L., & Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition. Springer New York.
Feldman, V. (2016). Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3576–3584). Curran Associates, Inc.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Upper and Lower Bounds of Sample Complexity of Supervised Learning

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置