Here in the screenshot, the image represents a single RNN unit. the parameters Wa and Ba gets affected and then A gets passed to the next layer. So, At each time step, same parameters are used Wa and Ba. How would the shared parameters help? Like if Wa and Ba learned a set of values at 1st time stamp, and they can learn a different parameters at the 2nd time stamp in that case we are loosing the values learned at thew 1st time stamp right? Can you please explain?
personally I was wondered about that but the best answer was that the shared parameters helps in :

applying the model to examples of different lengths. While reading a sequence, if RNN model uses different parameters for each step during training, it wonâ€™t generalize to unseen sequences of different lengths.

Oftentimes, the sequences operate according to the same rules across the sequence. For instance, in NLP:
"On Monday it was snowing" "It was snowing on Monday"
these two sentences mean the same thing, if we change the parameter in every time stamp the values would change and it mean that the two sentence isnâ€™t the same thing . Parameter sharing reflects the fact that we are performing the same task at each step, as a result, we donâ€™t have to relearn the rules at each point in the sentence.
Cheers!
Abdelrahman
The internal state of the RNN node needs to be complex enough to handle whatever the requirements are of the problem you are trying to solve. If there are lots of timesteps and complex relationships between what happens early and what happens later, then youâ€™ll need a complex RNN state and the training will take lots of examples to learn weight values that actually work well. The state can potentially learn to differentiate what happens early in the sequence from what happens later and make predictions based on that learning. Thatâ€™s the point.
We will later learn about ways to add more complex state to the RNN cell state by using LSTM cells in addition to the basic RNN cell. Stay tuned for that later in Week 1 of C5.
The Concept is same parameters are shared across each time step for each example. If thatâ€™s the case, then that the parameters are being updated after each example went through the network right?
And if the parameters are being shared at each timestep, then will the softmax give out the same word at each timestep?
No, because the input is different at each timestep, right? You have two inputs at each timestep: the new x value for the timestep and the updated state value a from the previous timestep. The parameters (the w and b values) may be shared across the timesteps, but they operate on different inputs each time.
Perfect! I undestood the concept much better now. One more question is the loss is computed during each timestep and summed over for each example.
Consider the case where there are 10 timesteps in an RNN model. It predicted the right answer for first 5 timesteps and wrong for last 5 timesteps. Now, we calculate the loss then take the sum for all the 10 timesteps. The parameters are gonna be updated. The model wonâ€™t know that it did mistakes for only the last 5 timesteps so that the parameters updated wonâ€™t affect the right answers. As the parameters are updated, they are gonna change the values for the softmax function which eventually changes the answers for the timesteps right?
Yes, the weights (parameters) are shared across all timesteps. You get gradients at every timestep by computing the loss function on the softmax outputs at each step. The \hat{y} value is compared with the y value for that timestep and how far off it is and in which direction determines the gradients. For the steps with an answer close to the correct label, the gradients should be small, but they will be larger and have more effect for the steps with wrong answers. The gradients for each full iteration are applied to the parameters as the average of the gradients across the timesteps.
First I am agree with @paulinpaloalto as the inputs changes in every timestamp â€¦also the weights update in every iteration but it updated by small values if the error is small and large values if the error is big according to it if in the first iteration you have 5 right answer and 5 wrong answer and in the second iteration the updated weights may get 8 right answers and 2 wrong answers and these 2 wrong answers may be in the first iteration was right answers but the weights was the random initialize and they may be an bias or overfitting so that the weights is tune to be more good and in this case the accuracy will increase and the error will decrease so that we go in the right way
although the 100% accuracy is best accuracy with best weights but it will lead to overfitting and we didnâ€™t search about it we search about best tunes of weights that Which simulates the new data and also gives a good accuracy in the data that you have trained on
Cheers,
Abdelrahman
Never mind. One last question. The gradients are calculated at each timestamp and the updating happens after one full epoch (One pass through the training set). The number of parameter updates in gradient descent is equal to the number of timesteps. Each update is with the average of that particular timestep for the number of examples that it saw in that one iteration.
Does that makes sense?
It depends on whether you are doing Minibatch GD or Full Batch GD, but at each â€śbatchâ€ť you are getting the average of the gradients across each timestep and across each sample. So the updates happen once per minibatch with that â€śdouble averageâ€ť.
So, For each training example, you get 10 gradients if there are time steps. you take the average of all the 10 gradients and then take the average of the gradients across all the examples in one full batch if it is a full gradient descent and then update it.
There will be 2 update equations. One for Wa and one for ba as said by Prof Ng.
Does my understanding right?
Yes, that sounds correct.
Thank you so much. i really took a lot of time to understand the concept. Thank you for your explanation and being so patient