DSL5 - What's the actual input data structure for an RNN?

In the NLP specialization, the data is always prepared as lists of tensors to represent each word, then pass to Trax embedding layer.

For each sentence, it is represented, for example, [12, 315, 532, 65, 321], based on the dictionary/vocab.

However, in this course, each word is represented as one-hot vector, such as
[0, 0, … 12, …], [0, 0, … 315, 0, …], and etc.

Although similar in representing a sentense, but different in terms of data structure.

Just would like to know what’s the actual data structure that is being used in the optimization algorithm?

Is the tensor representation a merely convenient way to pass to the software like Trax, then internally converted to the one-hot representation?

Input to an RNN layer is of format: (batch_size, sequence_length, num_features_per_timestep)

An Embedding layer is usually placed before an RNN layer when you want to learn the representation of inputs in embedding space. It accepts non-negative integers as input with shape (batch_size, sequence_length) and converts it to (batch_size, sequence_length, embedding_dim) for the RNN layer.

The way you show “one hot” representations is not quite right: there is one non-zero element, but that element has the value 1 in the position that corresponds to the index that is the category. So if the word is 12 in your “categorical” representation then the one hot vector has the 1 in position 12. I.e. the equivalent of what you would get by doing this sequence:

oneHotVector = np.zeros((vocab_size, 1))
oneHotVector[12] = 1

The reason for using one hot representations is that it is very efficient in terms of compute time, although it does cost you in terms of memory obviously. I believe that they normally use one hot representations when running optimizations (training) for the computational efficiency, but then it makes more sense to use the equivalent categorical representation when you are storing the data in some static form. Note that there are versions of the TF cost functions which support the input labels in categorical form, but my guess is that they convert to “one hot” on the fly “under the covers”.