Train Faster, Generalize Better: Stability of Stochastic Gradient Descent (Hardt 2016)

Nut graf: fast training prevents overfitting, and a mathematical way to state this.


Recall a : Let be the training data, and assume both and the test data are generated by draws from the same distribution . Let be the error incurred by a model parametrized by on a set of samples , where each sample is drawn from . The is defined as

The former term is the population risk and the latter term is the empirical risk. If the weights , a sample, a randomized function, then the expected generalization error is defined as the expectation of the generalization error, taken over the randomness in and . And if two samples and differ in at most one example, then is - if

where the loss function evaluated at . Key idea: such an algorithm has generalization error bounded by .


The key results are stability bounds for the application of SGD to -smooth, -Lipschitz, convex functions and certain classes of nonconvex functions. The results for the former require each step size to be bounded by . The results for the latter require that the function lie in , is L-Lipschitz, and -smooth, and that the sequence of step sizes is inversely related to the step number.

The upshots are cool: regularization, gradient clipping, dropout, proximal steps, and model averaging all improve the authors’ bounds.