On No Free Lunch Theorem

1 minute read

This note discusses the implications of the celebrated No-Free-Lunch Theorem on kernel SVM, RF-SVM and neural network.

Consistency and Learning Rate

Definition((Steinwart & Christmann, 2008)). Let $L$ be a loss function on the space $X\times Y\times\mathbb{R}$ , and $\mathbb{P}$ is a distribution on $X\times Y$ . Assume that $\mathcal{A}$ is a measurable learning method. Then $\mathcal{A}$ is said to be $L$ -risk consistent for $\mathbb{P}$ if, for all $\epsilon>0$ , we have

$\begin{equation} \lim_{m\to\infty}\mathbb{P}^m\left(D\in(X\times Y)^m:R_{L,\mathbb{P}}(f_D)\le R^*_{L,\mathbb{P}}+\epsilon\right)=1\,. \end{equation}$

Moreover, if $\mathcal{A}$ is $L$ -consistent for all distributions $\mathbb{P}$ on $X\times Y$ , it is called universally $L$ -risk consistent.

Basically, a consistent learning algorithm will return arbitrarily good hypothesis with high probability as long as there are sufficient samples.

Q: Is universal consistency a reasonable requirement?

However, it does not say anything about how many samples are required to achieve $\epsilon$ -optimal hypothesis. Actually the definition is equivalent to the following formulation

$\begin{equation} \mathbb{P}^m\left(D\in(X\times Y)^m:R_{L,\mathbb{P}}(f_D)\le R^*_{L,\mathbb{P}}+c_P c_\tau \epsilon_m\right)\ge1-\tau\,. \end{equation}$

The sequence $\epsilon_m$ is the learning rate and the constant $c_\tau$ is the confidence level. The No-Free-Lunch Theorem says that for a fixed learning rate and confidence level, there is no learning algorithms that work for all distributions.

Theorem((Steinwart & Christmann, 2008)). Let $\epsilon_m$ be a decreasing sequence that converges to $0$ . Let $(X,\mathcal{A},\mu)$ be an atom-free probability space, $Y:={-1,1}$ , and $L_{\text{class}}$ be the binary classification loss. Then for every measurable learning method $\mathcal{A}$ on $X\times Y$ , there exists a distribution $\mathbb{P}$ with $\mathbb{P}_X=\mu$ such that $R_{L_{\text{class}},\mathbb{P}}^*=0$ and

$\begin{equation} \mathbb{E}_{D\sim\mathbb{P}^m}R_{L_{\text{class}},\mathbb{P}}(f_D)\ge\epsilon_m\,. \end{equation}$

This theorem is in expectation, but actually similar results can be obtained in terms of the learning rate and confidence level and extended to any reasonable loss functions.

From the analysis of learning rate of SVM, we know that the generalization error can be decomposed into estimation and approximation parts. The estimation error can be bounded by the capacity of the hypothesis class such as VC dimension or Rademacher complexity uniformly for any distribution. However, the approximation part always entangles with the assumptions put on the distribution to be learned. For example, for KSVM, the approximation error can be controlled when the distribution satisfies Tsybakov’s condition or when there exists some hypothesis that will achieve a certain level of error rate. Then the learning rate will only generalize to the target distribution satisfying these assumptions. In such a way, the result about learning rate of KSVM is not contradictory to the No-Free-Lunch Theorem.

Reference

Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer New York.

Share on

Twitter Facebook LinkedIn

Yitong Sun

On No Free Lunch Theorem

Consistency and Learning Rate

Reference

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置