In RMS prop, we divide dW by E[dw^2], why not divide it by E[dw^2]- (E[dw])^2 which is the true variance of the variable dW over the last 1/(1-beta) terms?

Is the true variance not as useful as just the E[dW^2]? Why not?

In RMS prop, we divide dW by E[dw^2], why not divide it by E[dw^2]- (E[dw])^2 which is the true variance of the variable dW over the last 1/(1-beta) terms?

Is the true variance not as useful as just the E[dW^2]? Why not?

Hi @P_R_Siddharthan. It looks like you are now in Course 2 but you posted in the Course 1 forum. Please repost there. Thanks!

I just used the little “edit pencil” on the title to move this to DLS Course 2. But unfortunately I don’t know the answer to the actual question at hand here. If I can scare up a few minutes, I’ll go watch that lecture again. But that’s my suggestion for the answer in any case: listen to the lecture again a bit more carefully and I’ll bet Prof Ng addresses this question. I just did a quick scan and towards the end of that lecture, he mentions that RMSprop was first described by Prof Geoff Hinton in a Coursera course. I did a quick google and here are the PDF slides from that course and here’s the YouTube video of the lecture.

Unfortunately, Geoff Hinton’s lecture does not address why variance is not used. Just states how it is computed just as in our DL specialization.

Sorry that he doesn’t give any further explanation, but it’s good to hear that the two statements of the method are consistent.