A Latent Variable Model Approach to PMI-based Word Embeddings (Arora 2016)


There are two popular classes of word embedding techniques:

  1. Compute a low-rank approximation of a re-weighted co-occurrence matrix (i.e., PCA via SVD).
  2. Use a neural network language model’s representation of a word (e.g., word2vec, GloVe).

A re-weighting scheme used in (1) is to replace the co-occurrence statistics with the pointwise mutual information between two words.

The pointwise mutual information (PMI) of a pair of outcomes , , , discrete random variables measures the extent to which their joint distribution differs from the product of the marginal distributions:

Note that attains its maximum when or .

It is observed empirically that

Key Contribution

This paper proposes a generative model for word embeddings that provides a theoretical justification of and word2vec and GloVe. The key assumption it makes is that word vectors, the latent variables of the model, are spatially isotropic (intuition: “no preferred direction in space”). Isotropy of low-dimensional vectors helps explain the linear structure of word vectors as well.

The Generative Model

A time-step model: at time , word is produced by a random walk of a discourse vector that represents the topic of conversation. Each generated word has a latent vector that measures the correlation with the discourse vector. In particular:

is the -th word .

a small random displacement. Under this model, the authors prove that the co-occurrence probabilities and marginal probabilities are functions of the word vectors; this is useful when optimizing the likelihood function .