Question about how PCA analysis works

PCA analysis algorithm

So I was trying to figure out how this works mathematically.

Considering a 2D example that’s being reduced to 1D, if the data is represented as

\begin{matrix} x_1^{(1)} & x_2^{(2)} &... &x_j^{(1)} \\ x_1^{(2)} & x_2^{(2)} &... &x_j^{(2)} \\ ...&...&...&...\\ x_1^{(i)}&x_2^{(i)}&...&x_j{(i)} \end{matrix}

Call this matrix X.

The lecture shows that we are trying to minimize, Xv, where v=(x,y) for the variables x and y (as distinct from x_j^{(i)}). To minimize a function, we differentiate with respect to x and y, and set the gradient equal to zero.

\frac{\partial}{\partial x} Xv+\frac{\partial}{\partial y} Xv =0
X(1,0)+X(0,1)=X(1,1)=0

These lead to the equations,
\sum_j x_j^{(i)} = 0

which is equivalent to

Xk=0

for the vector k consisting of all ones. k is the kernel of the X matrix.

Then we can use Gram-Schmidt orthonormalization.Gram-Schmidt orthonormalization

Our first projection is k=(1,1), I guess. That seems odd, since it’s just a vector of ones and doesn’t depend on the data. Possibly that means the first projection isn’t the principal component.

proj_{u_1}(v)=\frac{<X,k>}{<X,X>}X

The second projection considers the data X perpendicular to the line (1,1) (proj_{u_1}(k), and subtracts it from the line v_2. So the question is, how is v_2 determined? In principle, using Gram-Schmidt, it can be any line. I guess (1,1) was also an arbitrary line. Probably there are some notation errors here and

v_2 = (1,1) - proj_{u_1}(k)

subsequently then,

v_n = v_{n-1}-proj{u_{n-1}}(v_{n-1})

That would result in minimizing the distance to the original guess and subtracting the difference.

Maybe after rotation to the new coordinate system, the entire v_n axis is removed from the data, and that is how the noise is reduced. Maybe at the end, there is some choice made about which v_n axes can be most safely removed in terms of the dispersion of the data relative to its trendline (or range, max-min), or something like that.

Does that sound about right? Is there anything I’m missing here? Curious how this works, and was hoping to experiment, by implementing this, though obviously I’m not going to if my notebook isn’t working at all, lol.

I probably also don’t need to actually run this experiment, but it would be nice to finish the PCA lab and be able to do the week 3 homework.

Steven

Based on my understanding I think your intuition is correct. Here’s a high level intuition, assuming all features are centered at 0:

After Gram-Schmidt orthonormalization, the vector projections along the orthonormal directions will be orthogonal to each other. We can order those directions in the decreasing order of variance (variance is same as “within sum of squares”; it can be proved that the total within sum of squares is the sum of the variances along the orthonormal directions). If we subtract the projections from the original vector in this order, the residual “information” (loosely put term for ‘residual’ within sum of squares) is lowest when we reach the last direction. The last few components are usually ‘noisy’.

1 Like

Thank you! That’s really helpful!