On No Free Lunch Theorem
This note discusses the implications of the celebrated No-Free-Lunch Theorem on kernel SVM, RF-SVM and neural network.
Consistency and Learning Rate
Definition((Steinwart & Christmann, 2008)). Let be a loss function on the space , and is a distribution on . Assume that is a measurable learning method. Then is said to be -risk consistent for if, for all , we have
Moreover, if is -consistent for all distributions on , it is called universally -risk consistent.
Basically, a consistent learning algorithm will return arbitrarily good hypothesis with high probability as long as there are sufficient samples.
Q: Is universal consistency a reasonable requirement?
However, it does not say anything about how many samples are required to achieve -optimal hypothesis. Actually the definition is equivalent to the following formulation
The sequence is the learning rate and the constant is the confidence level. The No-Free-Lunch Theorem says that for a fixed learning rate and confidence level, there is no learning algorithms that work for all distributions.
Theorem((Steinwart & Christmann, 2008)). Let be a decreasing sequence that converges to . Let be an atom-free probability space, , and be the binary classification loss. Then for every measurable learning method on , there exists a distribution with such that and
This theorem is in expectation, but actually similar results can be obtained in terms of the learning rate and confidence level and extended to any reasonable loss functions.
From the analysis of learning rate of SVM, we know that the generalization error can be decomposed into estimation and approximation parts. The estimation error can be bounded by the capacity of the hypothesis class such as VC dimension or Rademacher complexity uniformly for any distribution. However, the approximation part always entangles with the assumptions put on the distribution to be learned. For example, for KSVM, the approximation error can be controlled when the distribution satisfies Tsybakov’s condition or when there exists some hypothesis that will achieve a certain level of error rate. Then the learning rate will only generalize to the target distribution satisfying these assumptions. In such a way, the result about learning rate of KSVM is not contradictory to the No-Free-Lunch Theorem.
Reference
- Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer New York.