Weird Function Learned by A Simple Rule

2 minute read

(Proof of Problem 6.3 of (Devroye, Györfi, & Lugosi, 1997)

On Page 92 of (Devroye, Györfi, & Lugosi, 1997), the authors give an example of hard to learn function as follows.

On the interval , let the distribution spit out rational numbers and irrational numbers with equal probability. List all the rational numbers as a sequence . The probability over irrational numbers is uniform, and the probability over rational numbers is given by the sequence . And the labels of rational numbers are 1, and irrational numbers -1.

Since the labels are deterministic, the Bayes classifier can achieve 0 risk. In practice, we can never justify the rationality of a number. Nonetheless, we assume that we can get arbitrarily accurate digits of the number. It is surprising that we can find a consistent rule to learn this function. And the rule is very simple, it just assigns the same label of the closest sample to the test point!

To show the consistency of the closest neighbor rule, we only need to show that for sufficiently many samples, the probability of making mistakes can be arbitrarily small. The key of the proof is to note that under the setup given above, the rational numbers are very likely to repeat in the sample set. So when the test point is a rational number, it repeats some sample point with high probability; when it is an irrational number, it is more likely between two irrational sample points.

Assume we have samples and a test point with corresponding label . In the first part, we assume is a rational number. Then

It is not hard to see that this probability can be arbitrarily small when is sufficiently large. It means that when is rational, its closest neighbor will be itself in the sample set with high probability.

For the second part, we assume that is irrational. Denote the number of distinct rationals in the sample set by . We want to show that with a high probability, in the sample set will be less than .

On the other hand, denote the number of irrationals in the sample set by . Then

by Hoeffding’s inequality.

Therefore, by union bound we have

which is further greater than

So, when is sufficiently large, with high probability, . Now among irrationals, the position of is uniform, then the chance that is next to a rational sample is less than , which can be arbitrarily small as is large. If its left and right neighbors are both irrational samples, it will be labeled correctly.

Combining the results of both parts, we get the conclusion that, as , the probability of labeling the test point correctly approaches to 1. Clearly, this is only a consistency result instead of a learning rate. In general, there exists learning method that is consistent for any problem, but there does not exist learning method with the universal learning rate.

Reference

  1. Devroye, L., Györfi, L., & Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition. Springer New York.