Can anyone give any intuition or analysis why it’s OK to assume independence between 2 features?
Proesssor Ng gave a clue that a 3rd feature could be created as a ratio of the two correlated features.

As a related question, what about bi-modal distributions? It seems there is a possible similar answer: find another feature that separates the two peaks?

combining the two questions together, how much time is spent creating features, and how much time creating the model? and is there away to remove the need to create new features and discover them like in deep learning?

If they are dependent, then the assumption breaks. How wrong the assumption is is related to how dependent they are. If they are extremely weakly dependent, then the assumption is not very wrong and based on which we may still build an acceptable model. In other words, how wrong the model assumption is can be indirectly evaluated.

If you realize that you can’t assume independence among some features, then instead of the lecture’s approach, you may need a multi-variate distribution which takes into account their dependence (covariance).

You can certainly just use the bi-model distribution for that feature and that should work already.

But if you can separate them, it may help in other aspects of your modeling work!

Google for people’s practice. Some say it’s a 80-20 rule which you spend 80% of time on data. I suppose domain expert may spend less.

What is in, what is out (some said garbage in garbage out). Even in deep learning, it may require us to do data augmentation (to images, for example) to give a better result.

We also can’t forget what’s behind those successful deep learning models - really big data. When we don’t have that luxary, but we can engineer some good features, it can save a lot of time.

Deep learning uses gradient descent. If your current other modeling approach can be integrated into the gradient descent framework, then it worths a try!