$$
\newcommand{\pmi}{\operatorname{pmi}}
\newcommand{\inner}[2]{\langle{#1}, {#2}\rangle}
\newcommand{\Pb}{\operatorname{Pr}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\RR}{\mathbf{R}}
\newcommand{\script}[1]{\mathcal{#1}}
\newcommand{\Set}[2]{\{{#1} : {#2}\}}
\newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}}
\newcommand{\optmin}[3]{
\begin{align*}
& \underset{#1}{\text{minimize}} & & #2 \\
& \text{subject to} & & #3
\end{align*}
}
\newcommand{\optmax}[3]{
\begin{align*}
& \underset{#1}{\text{maximize}} & & #2 \\
& \text{subject to} & & #3
\end{align*}
}
\newcommand{\optfind}[2]{
\begin{align*}
& {\text{find}} & & #1 \\
& \text{subject to} & & #2
\end{align*}
}
$$

Nut graf: fast training prevents overfitting, and a mathematical way to state
this.

## Background

Recall a : Let be the training data,
and assume both and the test data are generated by draws
from the same distribution . Let
be the error incurred by a model parametrized by on a set
of samples , where each sample is drawn from .
The is defined as

The former term is the population risk and the latter term is the empirical
risk. If the weights , a sample, a randomized function,
then the expected generalization error is defined as the expectation
of the generalization error, taken over the randomness in and .
And if two samples and differ in at most one example,
then is - if

where the loss function evaluated at .
*Key idea: such an algorithm has generalization error bounded by .*

## Results

The key results are stability bounds for the application of SGD to
-smooth, -Lipschitz, convex functions and certain classes
of nonconvex functions. The results for the former require each step size
to be bounded by . The results for the latter require
that the function lie in , is L-Lipschitz, and -smooth,
and that the sequence of step sizes is inversely related to the step number.

The upshots are cool: regularization, gradient clipping, dropout, proximal
steps, and model averaging all improve the authors’ bounds.

- Are machine learning loss functions often Lipschitz and smooth? Is this too
stringent of an assumption?
- Again, are the assumptions made for the nonconvex case reasonable?
- It’s cool that the bounds match empirical findings.