In the following statement:
Then the explained variance given by the PCA can be interpret as
$$[Var(x) + Var(y), \ 0] = [0.0833 + 0.0833, \ 0] = [0.166, \ 0]$$
Why are we adding the variance of x and y? Can someone please explain what formula are we using here? I am not able to understand the mathematics.
Hi @tusharganguli
The variance of x
is 0.0833 (or 1/12, uniform distribution cheat sheet ), the variance of y
is also 0.0833. When we combine random variables, their variances sum (the video explaining it).
So, after pca transformation (rotatedData
) the variance of the first column is 0.166 and the variance of second column is 0, which in other words says, that after applied transformation the first column completely explains the variance of both x
and y
.
Hi @arvyzukai ,
Thank you for the explanation. However, my main issue is why are we adding the variance of x and y in case of the correlated uniform random variable? Does this have something to do with the way the PCA algorithm works? Continuing my confusion, the next section titled “Correlated Normal Random Variables” says that:
“The explained Variance of the PCA is [1.0094, 0.1125] which is approximately [1, 0.333 * 0.333] = [std1^2, std2^2], the parameters of our original random variables x and y”.
Here we are not adding the variance as we did previously.
That is a good question @tusharganguli
I’m not a big math person , so have that in mind, but I can offer my explanation.
In the first case we created the perfect copy of x to get y (perfect correlation), so the only thing the PCA had to do is to rotate the blue points to become orange ant the whole variance is explained by a single component (that is why we sum the variances).
In the second case the x and y were created separately (independently) and the correlation was created by rotation of them (not perfect correlation). In this case PCA found the rotation needed (45 angle) but since the x and y were not perfectly correlated - a single component cannot explain the whole variance. The best we can hope is that it explains the variance of each of them (var_1 = std_1^2, and var_2 = std_2^2).
But I don’t know the exact math behind it (in other words, I could not prove that from the top of my head). I believe @reinoudbosch would be a better suited person to answer this type of question definitely, because my intuition could be false
Hi @arvyzukai,
I think your explanation makes sense. If we perform the Singular Value Decomposition (SVD) and perform the necessary steps for PCA then we would get the explained variance as obtained in the two cases (Correlated uniform random variables and correlated normal random variables). I asked ChatGPT for the steps needed to compute PCA through SVD and it gave the following (I am paraphrasing here):
The steps involved in performing PCA using SVD are as follows:
- Center the data by subtracting the mean.
- Compute SVD.
- Determine the principal components.
- Compute the explained variance: The explained variance for each principal component can be calculated by dividing the squared singular value associated with the component by the sum of squared singular values.
- Choose the desired number of components.
- Project the data with reduced dimensionality.
I think step 4 is what calculates the explained variance which is equivalent to calculating the coefficient of determination for linear regression.
So, basically we perform the above steps and then we can analyze the outcome (which you explained and rightly so ) based on the way we have chosen our variables.
I hope my conclusions are in the right direction. Thank you for this discussion.