So can any one explain why dividing the EWA by sum of weights (1-beta^t) elimantes the bias in the initial stage .
Like is it done to increase the weightage of initial data points (0.2*theta1 etc…) ? if so why exaclty divide by sum of the weights
I hope you know that bias initialisation from 0 to random numbers than compared weight initialuze to a random number for selecting 0 to weight basically equates to your model learning nothing new about features of your data model but where as initialisation of bias bares no effect same effect as weight and acts threshold activation of neuron activation instead of diversifying the learning path.
Sorry but what are you trying to convey, i did not get your point? I am asking regarding the part where we divide the EWA with 1-beta^t . Is it because we just want to scaleup the weights for the initial stages
Krishna,
your question seemed to ask about why bias get eliminated at initial stage, so I wasnt trying to convey you anything but explaining why bias are initialised at 0 and weight at random number.
The equation in question you are asking is address the bias issue if it present when the weight curve don’t match between validation and training data.
so Vt/ 1-Bt is addressing the current data or the current iteration or the batch or epoch in question while training a model.
It is not scaling weight but determine the weight related to time step or current data point 2, ofcourse the idea behind for a model weight to learn something start from random scaling the weight to a random number.
Deepti,
I am asking why did they come up with dividing with the sum of weights (1-beta^t) ? Is it because as the weights increase after multiple iterations the denominator tends towards 1 and even EWA,s will be get unbiased as the no.of iterations goes up