I could not clearly understand how RNNs and Transformers handle inputs and outputs of varying sequence lengths (i.e. the window sizes Tx and Ty).
Request some clarity on this. Thanks in advance for the help.
Which assignment / exercise are you referring to?
Not referring to any specific assignment. Take a basic translation task for example or sentiment classification problem on a set of product reviews. The input sequences and maybe even the output sequences can be of varying lengths. Just want to understand how do RNNs and Transformers handle this.
Hey @Amit_Bhartia,
One of the simplest ways is to set a fixed length for all the inputs. To the best of my knowledge, this is pretty much the only way discussed in any of the assignments.
If any input is shorter than the fixed input length, pad it with zeros or something else as per the requirements. If the input is longer than the fixed input length, then trim it to have the fixed input length, and voila, you get inputs of the same size. I hope this helps.
Regards,
Elemento
Hi @Elemento,
Thanks for replying. Yes, your explanation does shed some light on it.
In the “Dinosaur Island” assignment, the inputs (and outputs Tx = Ty) were of different character lengths and there was no padding or truncating performed.
Even when we inspect the underlying building blocks (functions) of the rnn model used in that assignment, since Tx = Ty, the model simply loops over Tx for fwdprop and Ty for backprop. It led me to believe that RNNs can handle varying sequence lengths without any padding or truncating.
Since the model used stochasitc gradient descent and was therefore only processing one input-output pair per iteration, I wondered how would it handle this during mini-batch gradient descent.
Hey @Amit_Bhartia,
I would like to highlight some more points, just to make this crystal clear, starting with a thanks, since it helped me to revise my understanding too.
Let’s start with the easy part. In Transformers, as far as I can recall, you will almost always find this concept of fixed input length, since, in transformers, it is quite common to have something known as Positional Encoding, something you will learn in the future lecture videos, and while calculating Positional Encoding, you need to have the input length. I don’t follow any research in which the Transformers can handle differently sized inputs, but I am sure there would be one, so if you find it, do let us know as well. That’s for the easy part.
Now, let’s discuss the more interesting part. In any sequence model, be it a simple RNN, a GRU or a LSTM, the weights are shared and reused among all the different cells in a single layer. This means that your sequence models can easily work with different sized inputs. However, there is one very important thing that you must keep in mind, i.e., the inputs in a single batch. When we run our sequence model on any batch of inputs, we run the model iteratively over the timesteps, and hence, it becomes necessary for us to make sure that all the inputs in a single batch are of the same size, and this is where you will see the use of 0-Padding and Trimming.
However, there is no such restriction across the batches. We can simply have batches with differently-sized inputs, and as for the inference, since we mostly predict on the inputs one at a time, hence, they can easily be of different lengths.
An extended version of what I have explained here with code sample can be found on this amazing thread.
Now, coming to the Dinosaur Assignment. In that you can see that we are training the model one example at a time, i.e., batch size = 1, and hence, you will not find any use of Zero Padding and Trimming in this assignment. Just to sum it up, your belief that “RNNs can handle varying sequence lengths without any padding or truncating” is true to some extent, at least in the case of Stochastic Gradient Descent (Batch-Size = 1). I hope this helps.
Regards,
Elemento
Hi @Elemento,
This helps a great deal. Just wanted to ask a couple more follow up questions.
Since, the sequential models (RNNs/LSTMs/GRUs) can handle different batches where each batch may have a different sequence length, in real world applications, is it really beneficial to do that over the much simpler use of 0-padding & trimming?
Also, from the perspective of loss computation, wouldn’t a different length output impact the loss values for varying output sequence sizes?
Hey @Amit_Bhartia,
I haven’t really compared the 2 approaches myself in a real-world application, so I can’t say for sure which approach is better. I guess you would only have to try and find it out for yourself as per your application.
As for the second question, there is a very important fact that I am assuming you have missed upon. If the loss is computed with a batch of inputs having a longer length than the average length, then the same loss is used to update the same parameters a greater number of times than the average number of times.
So greater the number of values used to compute the loss, i.e., longer the inputs in a particular batch, the greater number of times, it will be used to update the same parameters, and I guess, this should nullify the effect of inputs in different batches having different sizes.
This is just one of the intuitions that is on top of my head. I would like to request other mentors @balaji.ambresh @anon57530071 @paulinpaloalto to take a look at this query and give their opinions.
Regards,
Elemento