On Inverse Dependence On Training Data Size

2 minute read

This is a note on the paper SVM optimization: inverse dependence on training data size ((Shalev-Shwartz & Srebro, 2008)).

Summary of Key Ideas

  • With fixed generalization accuracy, gaining more data should not increase the running time, because one can always ignore the extra data during training.
  • Bottou and Lin provided some empirical evidence for the claim that solving the dual problem of SVM costs at least ; actually the dependence on the sample size could be when the regularization parameter is very small
  • PEGASOS proposed by Shalev-Shwartz costs to find optimal solution;
  • And finally this paper claims that by using PEGASOS, to achieve $\epsilon$ generalization error, with the sample size increasing, the number of iterations $T$ will decrease.

Proof Sketch

The analysis is simple. First, denote the solution learned by PEGASOS by . Then its regularized expected risk is equal to the sum of following 3 terms:

where is the function in the hypothesis class that has a small expected risk. In our analysis of learning rate of SVM, is the function constructed to approximate the performance of the Bayes classifier . How small can be is about the approximation error of the hypothesis class, which is not considered in this paper.

The second difference is less than 0 by definition. So, we only need to give an upper bound on the first difference. It turns out to be bounded by

The difference in the brackets can be further bounded by

And in this way we successfully bound the estimation error by the optimization error. For PEGASOS, this error is upper bounded by with iterations. So the final upper bound for the generalization error is

And by choosing to be

we get the upper bound

From the form of the upper bound, it obvious that when is larger, can be smaller to guarantee the same level of generalization performance.

Some Comments

  • The regularization parameter is determined by the generalization error .
  • PEGASOS subsamples the training dataset with replacement at each iteration. When the size of the training dataset increases, the algorithm has a better chance of taking a new sample point instead of an old one at each iteration. Informally, we can imagine that the amount of information provided by the subsamples out of samples can be equal to that provided by subsamples out of samples.
  • Clearly, if the samples in the training dataset are generated i.i.d. by the true distribution, subsampling is not necessary. In other words, if we iterate all the samples in the training dataset in the cyclic way, we probably won’t have such a inverse dependence. Actually in the paper proposing PEGASOS((Shalev-Shwartz, Singer, Srebro, & Cotter, 2011)), Shalev-Shwartz said that “the sampling without replacement procedures outperform significantly the uniform i.i.d. sampling procedure.” However, there is no theoretical analysis on the sampling without replacement procedures.
  • When the training dataset contains infinite many samples, or the samples used at each iteration are directly generated by the true distribution independently, the upper bound is

Reference

  1. Shalev-Shwartz, S., & Srebro, N. (2008). SVM Optimization: Inverse Dependence on Training Set Size. In Proceedings of the 25th International Conference on Machine Learning (pp. 928–935). Helsinki, Finland: ACM.
  2. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming, 127, 3–30.

Tags:

Categories:

Updated: