Hi everyone! I realized that I wasn’t able to get a good intuition as to why different gradient descent versions would work better than others (especially RMSprop and Adam), so I made a little Colab exploring how these algorithms work on a simple quadratic function.
Wanted to share it with the community here; perhaps someone else will find it useful too.