Hello,
as you remember the function of gradient descent is to find cost function as minimum as possible with the help of learning rate. But using too large learning rate does not find you convergence in the result of cost function to be minimum and finding too small learning leads you too cause large number iterations. So to choose the right learning rate, one needs to go from 0.001 then 0.01 and 0.1 if you are training a model. Prof. Andrew has also suggested one more way is to increase the learning rate to three time. Eg if you have choose learning rate of 0.001 for a model training, then next choose 0.001 x 3 that 0.003 which will show you the model result with higher learning rate and then keep on reducing the learning rate toward 0.001 to find the proper fit for the cost function.
The initial spike in the model training you will always find because usually parameters are initiated at zero to get the required cost function. The goal is to achieve the least cost function with smaller learning rate and also in achievable iteration to reduce the training time. So as the iteration goes higher, the cost function is also reducing and after some iteration one sees a constant reduction in cost function.
I am attaching few of the images by Prof. Andrew explaining about learning rate, gradient descent. It is self explanatory. If you still are not satisfied with the answer, do ask!!!
Regards
DP