$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$

**Nut graf**: OptNet is a neural network architecture that includes
embedded within it *quadratic programs*. The paper’s main contributions
are: (1) an architecture that contains as layers (the solutions to)
quadratic programs; (2) a method to differentiate through quadratic programs;
(3) a method that reuses a matrix factorization used in solving a quadratic
program to render insignificant the cost of computing gradients for said
program; and (4) an efficient GPU-accelerated implementation of a primal-dual
QP solver. The authors acknowledge that training OptNets can be quite
challenging (many parameter changes are just reductions between problems),
but provide evidence for their claim that OptNet might better suited
than traditional neural networks to learn hard constraints via experimental
validation on learning to play mini-Sudoku.

The method differentiates through the KKT optimality conditions for quadratic programs. This method is well-defined because the solution of a quadratic program is subdifferentiable almost everywhere. Backpropagation reuses the LU factorization of the (symmetrized) KKT matrix (computed during the solve) at the solution when obtaining gradients, making the complexity of gradient computation quadratic.

- Solving a QP incurs cubic complexity layer; vanilla feedforward layers incur quadratic complexity.
- Training is quite hard and requires significant tuning.

- Optimization layers will not be practical in, e.g., Google-scale language / vision networks.
- Nonetheless the idea of automatically learning the parameters of an optimization problem is extremely interesting, for two reasons. (1) Optimization problems can encode hard constraints and (2) optimization problems can be quite interpretable.
- Differentiating through optimization problems provides a compromise between end-to-end learning and human expertise by providing a mechanism to encode whatever knowledge the modeler may have about their problem in the form of a mathematical program — the learning algorithm can infer the rest.