I’m wondering if there is a bug in the code outline of Exercise 4 in the Dinosaur exercise (see code excerpt below). It seems that we initialise a_prev only once at the very beginning, but then keep re-using it for every iteration and every example-sequence. Shouldn’t a_prev be set to a zero-vector for every single example that the model is trained on? I.e. reset a_prev at the beginning of the for-loop so that you use a clean internal state for every example. The way it’s implemented now (or the way I understand the code) the internal_state a_prev is shared for all sequences which doesn’t make sense to me. I might be wrong, but the way I understood the theory was that the internal state a_prev is something that is only depended on a single sequence and not dependent on all sequences.
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: model
def model(data_x, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27, verbose = False):
... # removed code for better readability
# Initialize the hidden state of your LSTM
a_prev = np.zeros((n_a, 1))
... # removed code for better readability
# Optimization loop
for j in range(num_iterations):
... # removed code for better readability
curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
Yes that’s true, but shouldn’t it be set to the zero vector after each run of optimize()? In other words, shouldn’t a_0 be the zero_vector for each input sequence?
The model() is learning by sampling the 7 dino names every 2000 iterations to see how the algorithm is doing. If it is kept returning to zero, how is it going to establish that relationship. The implementation notes can help clarify what the model is required to do.
I appreciate your comments and the time you take to answer my questions!
I already completed the exercise successfully, so I know how to solve the exercise according to the “grader”. I just think it’s either not 100% right or more likely that I’m making a stupid thought-error. I’m trying hard to comprehend where I’m making a thought-error, but even after your last reply it doesn’t make sense yet.
First, we’re not sampling the same names over and over again, we’re just sampling any 7 names after every 2000 iterations (the code is misleading, dino_names is just an integer). Further, a_prev from the model(…) method is not used in the sampling process at all. In fact we use the sample(…) method to sample a name and sample() correctly initializes it’s own a_prev to the 0-vector every time it’s being called (see code below).
for name in range(dino_names):
# Sample indices and print them
sampled_indices = sample(parameters, char_to_ix, seed)
last_dino_name = get_sample(sampled_indices, ix_to_char)
print(last_dino_name.replace('\n', ''))
Second, they way I see it, a_prev should only be set to zero for every new sequence we sample/train, not during a forward_pass within a sequence, so I don’t get your point regarding “establishing the relationship”. Also, if the sample(…) method initalizes a_prev with zero for every time we sample a new sequence, why wouldn’t the rnn_forward(…) method do the same for every sequence that we train on?
Third, when I implement the “fix” myself, i.e. change the code so that a_prev is initalized at the start of the for loop (see code below) and let the rest unchanged, I get a better loss and comparable names. I know this doesn’t proof anything, but it’s another indicator that something might be off here.
def model(data_x, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27, verbose = False):
... # removed code for better readability
# Optimization loop
for j in range(num_iterations):
X = ... # For clarity: X represents a random sequence at each iteration!
Y = ... # For clarity: the corresponding Y for the given X
... # removed code for better readability
# FIX: Set a_prev to 0 for every new sequence we train on
a_prev = np.zeros((n_a, 1))
curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
Congratulations for successfully completed this specialization. It is great that you are thinking about the course content even after you have completed the course, and questioned where there are doubt. It is with conversation like this that we all learn to improve.
To answer your points, here are my observations:
Sampling
The sampling referred here is about taking a sample size of 7 dino names from the
dataset that holds all the dino names. This names have been shuffle, so what are
stored in the examples are a list of 7 shuffled dino names.
The training model
This model used the list of shuffled examples for training. Each input sequence is an
example taken from the list of shuffled examples. Here x1 = y0, so X, and Y are the
set of integers that represent the char in a dino name with Y shifted
one index to the left, see notes for the optimize() explaining its input arguments.
initialising a_prev
a_prev is initialised to zero at the start of the training loop, and this is correct because
at the start, we do not know what a_prev is. However, as the training started, each
example learned carries values for the next example until the 7 examples have been
processed. If a_prev is reset to zero for processing of each example, then valuable
information learned from the last example would have been lost - This is what I
meant by relationship.
your output with a_prev reset to zero for each new sequence
Bearing in mind the training is only 7 sequences, so the dataset is very small. What
appeared to be better can not be quantity here. So just hold this thought.
I hope I have in some way answered your questions. If not, do let’s know.
I’m not sure what’s been logged as a solution here is really a solution. I think point 3 above is saying that we carry information to new examples by re-using activations from the previous examples. I guess this might be correct, but it does look odd that the code updates a_prev using the activations of the final layer and feeds these into the model as the initialisation set of activations - as shown in the code below.
Here: a_prev altered to new value: curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
And this is where a_prev comes from in the return values of optimize: return loss, gradients, a[len(X)-1]