W1 Ex2 : Why are we feeding 'a' of the last character into otimize

Hi, In the code for the dinosaur name generation there is a loop calling optimize with a_prev. For the first iteration a_prev is 0, but for subsequent iterations its set to the value of a of the last character in previous example. This might make sense if the next name needed to depend on the previous name, like in continuous text. But in this case, since each name is independent, shouldn’t we start a_prev at 0 every time to ensure it learns to pick the more likely first character?

This is the line of code I’m referring to:
curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters)

If I recall correctly, if you want a variety of names, you can’t pick the most likely letter every time.

Picking the most likely letter will give you only the same name every time.

Thank you for looking into this. I believe the randomness of sampling using np.random.choice, as well as varied names of the dinosaurs in the dataset should be playing the role of introducing variability of the output data. I was playing with a simple dataset, that had all names start with the same set of starting characters intentionally, and the model wouldn’t learn that all names must begin with the same characters, until I replaced the code with: optimize(X, Y, np.zero(a_prev.shape), parameters)

Hey @vladimire,
I am sorry for the delayed response. But let me add my final take here. There are 2 different aspects that we are going to discuss about. The first aspect is training and the second aspect is inferencing.

First, let’s talk about training. The purpose of RNN here is to learn to model the dinosaurs’ names that are present in the dataset. In this case, how a_prev is chosen for every iteration is simply a choice of training, whether we initialize a_prev to be all-0 for every iteration, or whether for the first iteration only, and for the later iterations, we simply pick the last value of a_prev from the recent iteration (just for a simple analogy, you can think of this as initializing a_prev with random values for every iteration, since as you pointed out, one dinosaur name should have no effect on the second one and so on). You can simply compare the 2 strategies and during the training, whichever fares better, you can pick that one.

Now, let’s come to the inference part, which I believe is the crux about this question. In this case, we will use RNN to generate dinosaurs’ names from scratch. So, how to initialize a_prev for inference? The answer is very simple. If we use our model to generate names one at a time, we won’t have any other choice than to initialize a_prev as all zeros or random values, so, once again, it’s your choice.

Now, as far as the diversity in names generated by RNN is concerned, I believe it depends more on how we handle the probability distributions generated at each time-step, rather than how we initialize a_prev for every name to be generated. For instance, if we pick the most likely character at each time-step, we will get a single name always (assuming no other source of stochasticity). If we sample characters according to the probability distributions, then we will get a much wider diversity in names.

This is my take on this. Now, let me add some quotes to your reply as well, in order to ensure that we are on the same page.

I just wanted to let you know that np.random.choice (as we have used it) samples according to the probability distributions, and not randomly, if it is unclear.

As for this, if for every generated sample, you initialize a_prev as same (say all zeros), and you pick only the most likely character at every time-step, you will still get the same name every time, irrespective of the variety of names in the dataset. The reasoning is pretty simple for this case. For t = 0, the inputs are same, the output distribution will be same, the most likely character will be same, and so on.

As for this, I still stand by my statement. It’s a training choice, and whichever suits your task better, you can go with that. I hope this helps.

P.S. - I will be deleting my previous replies, since I sense some sort of muddled thinking from my side in those.

Cheers,
Elemento