Significance of PCA and Explained Variance Ratio (explained_variance_ratio_)

I have two quesitons about the Principal Component Analysis (PCA) videos and labs.

  1. What are some real-life use cases of PCA? I understand that it can help with visualization, but once you visualize the data in fewer dimensions using the new primary components as axes, what does that mean for the data and next steps?
    • For example, in the lab we went from 1000 dimensions to 2 and then 3, and this shows that the data is clustered. But what does this clustering mean?
    • Do we then use k-means clustering to understand what the clusters are, and then using the PCA transformation know how much each of the 1000 dimensions contribute to the cluster?
    • It seems like we removed a bunch of data until we are able to find a pattern. It feels like a hack, like “if we prune the data using this systematic approach we end up with a pattern which is good”, but what about all the information that is lost? Does that mean that the extra data is not useful for finding patterns?
    • My point is that it feels like there is a missing video explaining the significance, usefulness and real-life examples of PCA.
      • We used PCA in the Math for ML Specialization to compress an image (which was pretty cool TBH), but this specialization says that’s an antiquated use case.
  2. In the lab for PCA (see screenshots below), the explained_variance_ratio_ says that we were able to preserve about 15% of the variance using 2D and about 20% of the variance if we use 3D.
    • Is this a good percentage? I know we are going down from 1000 to 2 or 3 dimensions, so it seems good that only 3 dimensions have 20% of the information.
    • But 15% or 20% seems little, no? What’s the significance of finding these 8 to 10 clusters if they are missing most of the information in the original data?

Thanks in advance for any suggestions and clarifications!


The only current use I’m aware of is to trim off features which don’t contribute significantly to the cost.

In the classic “handwritten digit classification” example, you can take the original 400 features (20x20 pixel images) and reduce it to maybe 150 features. This will make training faster.

But PCA itself is not a free lunch, since you’re doing some complex math on a rather large matrix - in the handwritten digit example, the training matrix is size (5000 x 400). That takes resources also.

But since computers have become so big and fast and cheap, saving memory by discarding features just isn’t as important as it once was.

The standard for PCA used to be retaining 99% of the variance.

3 Likes

Thanks for the quick response.

How do we do that? Do we create new features by using the new dimensions/axes? Or do we find the variance added by each feature and pick the top N features with the most variance?

I wish the lab or videos were clearer on this.

Why does the lab highlight preserving around 15% of the variance as impressive? Is it because we preserve 15% of the variance with only 0.2% of the features?

And we preserved only around 14.6% of the variance! Quite impressive!

There is an optional lab which covers how PCA is implemented.

I’m not sure why this lab claims 15% is impressive.

1 Like

Yes. The screenshots and quotes are from the optional PCA lab, hence this post. I’m wondering if any of the course creators or editors are in this forum and can help clarify some of this content.