I have two question on this topic.

Why is it not okay to have different data distributions?
Andrew used the analogy that the dev set is the target we aim at using the train set, and the test set is the actual target we want to hit, so if the two targets are not the same, we would spend a lot of resource aiming at the wrong target. But I thought we train the model to learn the mapping given the distribution.
But I am confused with this analogy. I thought we train models to learn the “mapping” of the data, not the distribution of the data. So the target should be the mapping function, not the distribution of the data. If mapping is the same between the dev and test set, the target should also be the same.
For example, suppose the true mapping between input x and output y is y = x^2. For train and dev set, we have x sampled from a normal distribution, and for test set we have x sampled from a exponential distribution . Now the distribution has changed, but as long as the model learned the mapping, which is y = x^2, it should work perfectly well on the test set! 
From the course I learn that it is okay if the train and dev set have different distributions, but the dev and test set should have the same distribution. Why?