$$ \newcommand{\pmi}{\operatorname{pmi}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}{\mathbb{E}} \newcommand{\RR}{\mathbf{R}} \newcommand{\script}[1]{\mathcal{#1}} \newcommand{\Set}[2]{\{{#1} : {#2}\}} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$

There are two popular classes of word embedding techniques:

- Compute a low-rank approximation of a re-weighted co-occurrence matrix (i.e., PCA via SVD).
- Use a neural network language model’s representation of a word (e.g., word2vec, GloVe).

A re-weighting scheme used in (1) is to replace the co-occurrence statistics
with the *pointwise mutual information* between two words.

The pointwise mutual information (PMI) of a pair of outcomes , , , discrete random variables measures the extent to which their joint distribution differs from the product of the marginal distributions:

Note that attains its maximum when or .

It is observed empirically that

*This paper proposes a generative model for word embeddings that provides a
theoretical justification of* *and word2vec
and GloVe*. The key assumption it makes is that word vectors, the latent
variables of the model, are spatially isotropic (intuition: “no preferred
direction in space”). Isotropy of low-dimensional vectors helps explain
the linear structure of word vectors as well.

A time-step model: at time , word is produced by a random walk of a discourse vector that represents the topic of conversation. Each generated word has a latent vector that measures the correlation with the discourse vector. In particular:

is the -th word .

a small random displacement. *Under this model, the authors
prove that the co-occurrence probabilities and marginal probabilities
are functions of the word vectors; this is useful when optimizing the
likelihood function* .

- This paper answers an interesting question: Why is it that a nonlinear model like word2vec produces outputs that have linear structures (e.g., king - man = woman)?
- It’s really cool that a relatively simple generative model grounded in a solid theoretical foundation produces results that are competitive with neural network models.