I had a doubt regarding why we representing the input as (nx,m,Tx) and output as (ny,m,Ty) and also nx and ny can be different? i am confused over what advantage we get by representing like this instead of (m,nx,Tx)? if i want feature vector of a word at particular timestep how do i extract it?
The use of (nx, m, Tx) and (ny, m, Ty) is common in RNN implementations because it aligns well with how data is processed during both the forward and backward passes, providing both memory efficiency (keeping m, the batch size, as the middle dimension allows efficient memory access during computation) and ease of access to specific time steps or batches (for example, to get the features of the first word over the entire batch, one can slice along the time dimension). The choice to have nx and ny as different dimensions allows flexibility in handling different input and output feature sizes, especially in tasks like machine translation, where the input and output vocabulary sizes may differ. If we use (m, nx, Tx), it would mean that we place the batch size first, which is common in fully connected networks, but less so in RNNs due to the need to handle sequences efficiently. A drawback of (m, nx, Tx) is that accessing features at a given time step over all examples becomes less intuitive, and it may not fit well with the underlying computational optimizations of many deep learning libraries. If you want to extract the feature vector of a word at a given time step t, you would typically do: xt = x[: , : , t] and this gives you a matrix of shape (nx, m). Here, xt will contain the feature vectors for all words at time step t across all examples in the mini-batch.
Sorry to interrupt but (batch_size, timesteps, features) is in wide usage now. Guessing that the underlying library has an influence on the input format, please see this link to see how oneDNN library encouraged pytorch to support channel last approach.
Thanks for bringing this up! The choice of input format depends on the framework and the underlying hardware optimizations. You’re right, in PyTorch, the (batch_size, timesteps, features) format is widely used.
what will be the size of vector n_a and does this depend on any parameters in RNN, for example like size of vocab or its designer choice? please clarify it.
na is considered a hyperparameter of the RNN and is typically chosen by experimentation or using heuristics based on similar tasks. It does not directly depend on the other parameters. While nx and ny are determined by the problem at hand (e.g., vocabulary size, embedding size), na is chosen independently based on how much “memory” or “context” the RNN needs to keep to perform well on the task.
The 1st link I shared is from tensorflow docs. Both platforms support channels last format.
Thanks for pointing this out! TensorFlow and PyTorch often prefer the (batch_size, timesteps, features) format, mainly due to the performance benefits of optimizations such as those provided by the oneDNN library.