After watching video on plateau issues I started to wonder if we can at each iteration test several values for the learning rate and choose the one, which gives the best decrease in the cost function. This is what I think is called line search in gradient descent literature. So, when we are at a plateau we can use larger learning rate to speed up optimization. The downside of this is that we will have to compute cost function several times at each iteration of gradient descent, so this slowers down the learning.

Do people in deep learning try this strategy or the increased computational time because of calculating cost function several times overcomes the benefits of better choice of learning rate?

We use Adam, RMSprop, and similar techniques to combat the saddle point issue. They are fast and improve optimization, not only for plateau issues but also throughout the entire search for a local minimum. Why do they work? Maybe you have figured it out already. With a current gradient close to zero (we are slowly moving down the saddle), we have inertia and direction from previous steps, which hopefully is enough to leave the saddle point. Without inertia and previous direction, we would have to walk all down to the saddle point minimum before leaving the area for an even lower cost.

In course 5, we will learn more about a variant of line search called beam search:

I think I got it. Momentum techniques will probably prevent us from getting on the path with small gradients which was presented by Andrew in the video. Thanks!