$$
\newcommand{\pmi}{\operatorname{pmi}}
\newcommand{\inner}[2]{\langle{#1}, {#2}\rangle}
\newcommand{\Pb}{\operatorname{Pr}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}}
\newcommand{\optmin}[3]{
\begin{align*}
& \underset{#1}{\text{minimize}} & & #2 \\
& \text{subject to} & & #3
\end{align*}
}
\newcommand{\optmax}[3]{
\begin{align*}
& \underset{#1}{\text{maximize}} & & #2 \\
& \text{subject to} & & #3
\end{align*}
}
\newcommand{\optfind}[2]{
\begin{align*}
& {\text{find}} & & #1 \\
& \text{subject to} & & #2
\end{align*}
}
$$

## Background

There are two popular classes of word embedding techniques:

- Compute a low-rank approximation of a re-weighted co-occurrence matrix
(i.e., PCA via SVD).
- Use a neural network language model’s representation of a word (e.g.,
word2vec, GloVe).

A re-weighting scheme used in (1) is to replace the co-occurrence statistics
with the *pointwise mutual information* between two words.

The pointwise mutual information (PMI) of a pair of outcomes ,
, , discrete random variables measures the extent to
which their joint distribution differs from the product of the marginal
distributions:

Note that attains its maximum when or .

It is observed empirically that

## Key Contribution

*This paper proposes a generative model for word embeddings that provides a
theoretical justification of* *and word2vec
and GloVe*. The key assumption it makes is that word vectors, the latent
variables of the model, are spatially isotropic (intuition: “no preferred
direction in space”). Isotropy of low-dimensional vectors helps explain
the linear structure of word vectors as well.

## The Generative Model

A time-step model: at time , word is produced by a random walk of
a discourse vector that represents the topic of
conversation. Each generated word has a latent vector
that measures the correlation with the discourse vector. In particular:

is the -th word .

a small random displacement. *Under this model, the authors
prove that the co-occurrence probabilities and marginal probabilities
are functions of the word vectors; this is useful when optimizing the
likelihood function* .

- This paper answers an interesting question: Why is it that a nonlinear model
like word2vec produces outputs that have linear structures
(e.g., king - man = woman)?
- It’s really cool that a relatively simple generative model grounded in a
solid theoretical foundation produces results that are competitive with
neural network models.