Parameters Diverging When Learning Rate is too Large

Hey, everyone! I am trying to better understand Andrew’s diagrams from “Choosing the learning rate” at around 1:36. It seems to indicate that, if the learning rate is too large, w1 might diverge from the value that would minimize the cost function, but I don’t understand how this is possible. If the learning rate is too large, I understand that it could overshoot, but I still think it would get closer every iteration to the optimal value of w1. For example, if the optimal value of w1 is 0, but w1 = 5, it might bounce from 5 → -4 → 4 → -3 → … around 0. Am I missing something here? The diagrams seem to indicate that it could go something like this: 5 → -7 → 12 → -20 → … Thanks everyone in advance!

If the learning rate is too large, it might go from 5 → 6, and then 6 → 7, and repeat forever toward infinity.