Lecture Notes

19. Principal Component Analysis

Part of the Series on Linear Algebra.

By Akshay Agrawal. Last updated Dec. 20, 2018.

Previous entry: Least Squares

In this section, we present an important application of the SVD known as principal component analysis (PCA). It takes -dimensional vectors and finds a basis consisting of orthonormal vectors that approximates them well. PCA is therefore a linear dimensionality reduction technique. It is used in machine learning, statistics, and exploratory data anlysis; PCA can be used to compress data, de-noise them, interpret them, and, when or , visualize them.

Suppose we have data matrix , meaning that we have different observations of variables, and each observation is a row of . We seek an orthonormal list , and , such that the Euclidean distance from the observations and the subspace spanned by the is minimized.

The optimization problem to solve is

where is the optimization variable. Or, equivalently,

If is a solution to this problem, then can be interpreted as an embedding of , and the i-th row of is an embedding of (that is, the i-th row of ).

It turns out that the solution is to use the first right singular vectors of .

In applications, the data matrix is typically centered, so that the sum of its rows equals , and normalized, so that the standard deviation of each column is ; these assumptions are useful in practice (see the section on pairwise distances below).

Solution via the SVD

The distance from each observation to its projection on the subspace spanned by the is (see the section on projections), and moreover by the Pythagorean theorem

Since is a constant, minimizing the distance from each observation to its projection is equivalent to maximizing

That is, we can minimize the distance between the observations and the subspace spanned by the by maximizing , subject to the constraint that the columns of are orthonormal. Notice that

is attained at the first right singular vector of , or equivalently the top eigenvector of . Assume inductively that , where the th right singular vector of . Let be the matrix , then is a unit vector that attains

Since , it follows that .

The PCA solution, therefeore, takes to be the first right singular vectors of , . Notice that the components of our data, with respect to the basis , can be computed as

where the are the left singular vectors of and the are its singular values. The matrix is sometimes referred to as a score matrix. The vectors are called the principal vectors of , and, for each observation , is called the -th principal component of . The coordinates of an observation in the basis are called the principal components of the observation.

Via the gram matrix

The score matrix or embedding matrix given by PCA is . Above, we computed by first computing the top -eigenvectors of , storing them in a matrix , and computing . We can instead compute by taking an eigenvector decomposition its gram matrix: .

Connection to low-rank approximation

PCA is equivalent to finding an optimal low-rank approximation to the data matrix ; instead of outputting the rank- approximation, PCA outputs an orthonormal basis that can be used to construct the low-rank approximation. Because the SVD furnishes optimal low-rank approximations, this is another sense in which PCA and the SVD are intimately related.

The coordinates of the th observation with respect to the bases are stored in the th row of ; i.e.,

In particular, the -th row of the matrix

stores the approximation of computed by PCA. This matrix is nothing other than the optimal rank- approximation of .

Connection to pairwise distances

When the are centered (i.e., when ), PCA can be be equivalently phrased as the following problem:

Note that each term in the sum is nonnegative because projections do not increase distances. This problem is in turn equivalent to

The term can be interpreted as the squared distance between the embedded points and . In this sense, PCA learns a linear (orthogonal projection) embedding that tries to preserve squared pairwise distances.

To see why this is true, note that the objective can be rewritten as

The last two terms vanish because the data is centered.

References

  1. Tim Roughgarden and Gregory Valiant. CS168: The Modern Algorithmic Toolbox Lecture #7.
  2. Cosma Shalizi. Advanced Data Analysis from an Elementary Point of View.