In PCA we compute eigenvectors and eigenvalues of the covariance matrix. This means we assume the covariance matrix as a linear transformation matrix T. But why do we use only a covariance matrix to compute eigenvectors and eigenvalues and not any other matrix?

Secondly, once we find the eigenvectors (or principal components) we multiply the standardized data X by the matrix X formed by the principal components to obtain new dataset Z.

Z = X.W

Why do we do this computation?

This is all based on the mathematics of PCA and how it works. Prof Serrano explained it in the lectures. The covariance matrix is a square matrix that gives the strength of the covariance between each of different features across all the samples in the original dataset. You then compute the eigenvectors and eigenvalues of that matrix and it gives you very valuable information about the contents of your data. The eigenvectors area a basis for a transformation of the original data into a new “space” or orientation in which the eigenvectors sorted in order of decreasing eigenvalues tell you the dimensions of the transformation that are the strongest and thus most relevant to any solution you want to find. The point of PCA is that you can then choose a threshold below which the dimensions don’t have much effect on the solution of your problem, as indicated by small eigenvalues. So you can transform into the new space and then just drop the dimensions which are weakest in order to streamline your computations.

The math here is pretty deep. Prof Ng also covered PCA in the original Stanford Machine Learning series. Here’s his main lecture as a YouTube video is you want another “take” on this topic. In his presentation he uses Singular Value Decomposition, which is another way to package what I was describing above w.r.t. the covariance matrix calculations and eigenvalues.

What does each column vector in n * n eigenvectors mean?

Does taking eigenvectors of any transformation matrix T result in a vector that captures the most variance in the data?

I don’t think I understand the question. There are as many eigenvectors as the rank of the matrix, right? So what do you mean “result in *a vector* that captures most of the variance”? Not every transformation will have a *single* eigenvector that dominates, if that’s what you’re asking. It all depends. But this is with the proviso that I mentioned at the beginning: I’m not sure I understand your point.

Let me frame it this way: what’s the importance of computing eigenvectors of any linear transformation matrix T?

If you simply want to solve a system of linear equations, it is not necessary. But one thing that computing the eigenvectors and eigenvalues can do for you is allow you to perform dimension reduction using PCA, as Prof Serrano is explaining here.