Can anyone give any intuition or analysis why it’s OK to assume independence between 2 features?
Proesssor Ng gave a clue that a 3rd feature could be created as a ratio of the two correlated features.
As a related question, what about bi-modal distributions? It seems there is a possible similar answer: find another feature that separates the two peaks?
combining the two questions together, how much time is spent creating features, and how much time creating the model? and is there away to remove the need to create new features and discover them like in deep learning?
If they are dependent, then the assumption breaks. How wrong the assumption is is related to how dependent they are. If they are extremely weakly dependent, then the assumption is not very wrong and based on which we may still build an acceptable model. In other words, how wrong the model assumption is can be indirectly evaluated.
If you realize that you can’t assume independence among some features, then instead of the lecture’s approach, you may need a multi-variate distribution which takes into account their dependence (covariance).
You can certainly just use the bi-model distribution for that feature and that should work already.
But if you can separate them, it may help in other aspects of your modeling work!
Google for people’s practice. Some say it’s a 80-20 rule which you spend 80% of time on data. I suppose domain expert may spend less.
What is in, what is out (some said garbage in garbage out). Even in deep learning, it may require us to do data augmentation (to images, for example) to give a better result.
We also can’t forget what’s behind those successful deep learning models - really big data. When we don’t have that luxary, but we can engineer some good features, it can save a lot of time.
Deep learning uses gradient descent. If your current other modeling approach can be integrated into the gradient descent framework, then it worths a try!
This is also bugging me.
Say we have two highly related features x_1,x_2 that look like a multi-variable Gaussian, where \mathrm{Mean}(x_1)=\mathrm{Mean}(x_2)=0 \mathrm{Var}(x_1)=\mathrm{Cov}(x_1,x_1)=1, \mathrm{Var}(x_2)=\mathrm{Cov}(x_2,x_2)=1, \mathrm{Cov}(x_1,x_2)=0.5
If you use the product of single-variable Gaussians, you end up predicting anomaly outside a circle (radius depends on threshold), but for the same threshold the actual borderline would be a slanted ellipse. You end up with a model with high bias despite the choice of threshold.
I’m guessing that most of these problems start with (essentially) independent features, so it’d be totally find to assume independent Gaussian distributions, but as you detect dependency or purposedly add dependent features, you should switch to the multivariable Gaussian.
Or more generally, adding to:
Just use whichever distribution model that seems to fit well.
Thank you, Joe @shinli256, for sharing your perspective! Very helpful!
Just to expand it by a little bit- it is like, knowing that the data distributes like below, where the threshold boundary may just look like the green line, and if we were still forcing a circular threshold boundary by applying the product of two Gaussians, we were … (I will let learners to fill it in, but Joe’s description of Bias is very good and fit very well the situation and the meaning of Bias.).