RNN Shapes Clarification

Hi,

I am getting hard time understanding the 3d dimensions specified in the assignment.

Can I please get an example of (na, m, Tx), (ny, m, Ty)

For example, let’s assume we have the following training set with output (good: 1, bad: 0):

  • The movie was good.
  • That was bad.
  • It seemed exciting

len(dict/vocab): 1000

For the first sentence, is (na, m, Tx) & (nx, m, Tx) (1000, 3, 4)? (this is vectorized)

also, what is (ny, m, ty)? is it (2, 3, ?)

Lastly, what is exactly the “time step” for my example?

Prof Ng spends quite a bit of time on these issues in the lectures. It might be worth watching them again. In Sequence Models, there is quite a bit more flexibility in terms of the way you map from inputs to outputs than there are in DNN or CNN architectures. Look for the lecture in which Prof Ng shows this information, which I wrote in my notes:

What if T_x is different from T_y?

  • many to many (same or different)
  • many to one
  • one to one
  • one to many

He then proceeds to give examples of all those types of networks and the types of problems they are applicable for. An example of many to many with different input count and output count would be translating sentences from English into French or vice versa: the “timesteps” in the input are the individual words, but there is no guarantee that the French translation will have the same number of words (could be more or could be less in different examples). An example of “many to one” would be sentiment classification, where again the T_x is the number of words in the input sentence (which varies per sample) and then the output is one value (either “Positive/Negative” or maybe a softmax output with more choices).

To respond more specifically to your question:

n_a is the size of the “hidden state” of your RNN node. If you mean the shape of the input, it would be (nx, m, Tx), which would be (1000, 3, 4) in your example. Then the output would be (ny, m, Ty) which would be (2, 3, 1) in that case, because there is only one timestep in the output (the sentiment). It might be the case that you could get away with (1, 3, 1) in that case: a binary output is a special case of softmax with n = 2, so you really only need one value to represent the answer (meaning that a “one hot” vector with two elements is redundant).