$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$

This paper introduces a family of feedforward neural networks in which
each activation function is represented as an argmin of a convex optimization
problem; these representations are encoded in the problem of training a
neural network via penalties. The key to arriving at such a representation
is to *lift* the standard neural network optimization problem into a
higher-dimensional space by

- introducing for each layer a variable representing its output layer,
- presenting each activation function as an argmin of a divergence function that is convex in each argument (but not necessarily jointly convex in both arguments), and
- coercing the divergences to be small via penalization.

El Ghaoui and his co-authors refer to neural networks that have been rewritten in this way as “lifted” neural networks.

The upshot: Any lifted neural network can be optimized in a block-coordinate, gradient-free fashion using well-known algorithms for convex optimization, and, after training, the values of its optimization variables can be used as initialization for the corresponding “unlifted” neural network.

This paper can be understood as part of an ongoing research effort to
simplify the mathematical structure of neural networks and to express them
in principled ways that are more amenable to training^{1}^{2}.

The experiments in this paper are not particularly compelling. MNIST is used as a benchmark, and the results aren’t great. What’s more, training time is not reported (only the number of epochs is reported). The main contribution of this paper is its theoretical framework.