$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$

TensorFlow uses a dataflow graph to represent both computation and state. Nodes represent computation upon mutable data, and edges carry tensors, or multi-dimensional arrays, between nodes. The system uses synchronous replication successfully, contradicting the folklore that asynchronous replication is needed for scalability.

DistBelief, the prequel to TensorFlow, was limited by its parameter server architecture; at times it is desirable to offload computation onto the server that owns the data. As such, TensorFlow eschews the separation of workers and parameter servers in favor of a hybrid model. The key design principles of TensorFlow are

- graphs are composed of primitive operators,
- execution is deferred, and
- a common abstraction supports any accelerator that implements its interface.

Note that a batch dataflow model, which favors large batches of computation and requires immutable inputs and deterministic computation, is not appropriate if stochastic gradient descent is your optimization method of choice. TensorFlow allows for mutable state ‘‘that can be shared between different executions of the graph’’ and ‘‘concurrect executions or overlapping subgraphs.’’

A **tensor** is a multi-dimensional array that stores primitive types; one
of those primitive types is a string, which holds arbitrary binary data.
All tensors are dense in order to ensure that memory allocation and
serialization can be implemented efficiently. Sparse vectors can be encoded
as either variable-length string elements or tuples of dense tensors. The
shape of a tensor can vary along its dimensions.

**Stateless operations** map one list of tensors to another list of
tensors. The simplest way to think of such an operation is as a mathematical
function.

A **variable** is a stateful operation. Each variable owns a mutable
buffer that, for example, holds the model parameters as it is trained.
Variables take no inputs. They instead expose a `read`

operation and various
write operations. An example of a write operation is `AssignAdd`

, which
is semantically equivalent to the familiar ‘‘plus-equals.’’

Queues allow for concurrent access to the tensors that they hold. They can provide backpressure when they are full and are used to implement streaming computation between subgraphs.

Every operation is placed on a *device*, and each device assembles its
operations into a subgraph. TensorFlow is ‘‘optimized for executing large
subgraphs repeatedly with low latency.’’

Conditional statements and other control flow primitives are supported.

Users can handroll their gradients if they so desire, or they can rely on the automatic differentiation. It is simple to implement new optimization algorithms using the TensorFlow framework; no code changes are required.

Both asynchronous and synchronous SGD are supported; the latter converges to a good solution faster than the former (both in practice and, I assume, in theory). In the synchronous scheme, firing up redundant workers and taking the updates from those who finish first improves throughput by up to 10 percent.

- The authors list improved models, larger datasets, and software platforms that allow for better use of hardware as the primary drivers of the machine learning renaissance, in that order. I am surprised that improved models is first in that list.
- I like the decision to encode sparse data as assemblages of dense tensors. But does this representation allow for the underlying linear algebra to make use of sparse matrix structure?
- An interesting problem that remains unsolved (at Google, at the time of writing the paper): How can we accommodate dynamic dataflow graphs for tasks like deep reinforcement learning?
- I think it would be useful if TensorFlow & researchers were together working on the problem of intelligent initialization for nonconvex problems.