# Train Faster, Generalize Better: Stability of Stochastic Gradient Descent (Hardt 2016)

Nut graf: fast training prevents overfitting, and a mathematical way to state this.

## Background

Recall a $\textbf{defintion}$: Let $\mathcal{D}$ be the training data, and assume both $\mathcal{D}$ and the test data are generated by draws from the same distribution $X$. Let $e(w; x)$ be the error incurred by a model parametrized by $s$ on a set of samples $x$, where each sample is drawn from $X$. The $\textbf{generalization error}$ is defined as

The former term is the population risk and the latter term is the empirical risk. If the weights $w=A(S)$, $S$ a sample, $A$ a randomized function, then the expected generalization error is defined as the expectation of the generalization error, taken over the randomness in $S$ and $A$. And if two samples $S$ and $S'$ differ in at most one example, then $A$ is $\epsilon$-$\textbf{uniformly stable}$ if

where $f$ the loss function evaluated at $z$. Key idea: such an algorithm has generalization error bounded by $\epsilon$.

## Results

The key results are stability bounds for the application of SGD to $\beta$-smooth, $L$-Lipschitz, convex functions and certain classes of nonconvex functions. The results for the former require each step size to be bounded by $\approx 1/\beta$. The results for the latter require that the function lie in $[0, 1]$, is L-Lipschitz, and $\beta$-smooth, and that the sequence of step sizes is inversely related to the step number.

The upshots are cool: regularization, gradient clipping, dropout, proximal steps, and model averaging all improve the authors’ bounds.

## Commentary

• Are machine learning loss functions often Lipschitz and smooth? Is this too stringent of an assumption?
• Again, are the assumptions made for the nonconvex case reasonable?
• It’s cool that the bounds match empirical findings.