This is the code provided by the exercise, but I don’t understand it at all.
What’s the difference between the X and x?
What is len(X), is it the number of time steps or the length of features, i.e. the possible number of characters in this case 27??

def rnn_forward(X, Y, a0, parameters, vocab_size=27):
# Initialize x, a and y_hat as empty dictionaries
x, a, y_hat = {}, {}, {}
a[-1] = np.copy(a0)
# initialize your loss to 0
loss = 0
for t in range(len(X)):
# Set x[t] to be the one-hot vector representation of the t'th character in X.
# if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector.
x[t] = np.zeros((vocab_size, 1))
if (X[t] != None):
x[t][X[t]] = 1
# Run one step forward of the RNN
a[t], y_hat[t] = rnn_step_forward(parameters, a[t - 1], x[t])
# Update the loss by substracting the cross-entropy term of this time-step from it.
loss -= np.log(y_hat[t][Y[t], 0])
cache = (y_hat, a, x)
return loss, cache

We wrote code very similar to this in the RNN Step by Step exercise that was the first assignment in Week 1 of Sequence Models, right? It might be worth comparing this code to what you wrote earlier.

Yes, len(X) is the number of “timesteps” or elements in the sequence, not the number of features. That is vocab_size, right?

The difference between x and X is explained in the comments: x is the “one hot” version of X initially.

Thanks for your reply Paul. Yes, I did compare with the function I wrote, but I still don’t understand this one. So does it mean that the X here is not yet one hot coded? In the previous exercise it was already one hot coded. X has the dimension of (n_x, m, T_x), right? n_x here is vocab_size.

Since in the comment, it says “x[t] to be the one-hot vector representation of the t’th character in X.”, so it looks like t is the number of training examples m.
So X in this case has dimension of (m, T_x), so X[t] has one dimension (1, T_x).
x[t] has dimension of (n_x, 1).
But I’m lost again here: X[t] != None. how can you judge if a vector is None or not?

Yes, they tell you that in the comments: X is not one-hot encoded. They are literally writing out the logic for you to create x as the one-hot encoded version of X one timestep at a time. You can print the shapes of the inputs if you want to confirm what shapes they are. The business about testing X[t] for None is also explained in the comments: the first element is apparently not set for timestep 0.

Thanks so much Paul, I think I understand now. So we only consider one training example here, X is a single dimension vector containing the positions (0-26) of each of its character, could be something like [2, 6, 25, 16]. The length of X is the number of timesteps of the RNN. x is a dict with each key being each timestep of the RNN, and value being one-hot coded version of that timestep, 2 would become [0, 0, 1, 0, …, 0], a list of 27 length.
But why in the first exercise the input has a dimension of (n_x, m, T_x), which m is the number of training examples in the mini-batch, and in this exercise we only use one example at a time?

They specifically say in the comments that they are using Stochastic Gradient Descent here, so that means you only need to handle one sample at a time. Maybe they did that for simplification? Or maybe they have a priori knowledge that it works better than Minibatch for this particular case? I don’t know why they did it that way here. Of course the previous exercise was for the fully general case …