Vanilla RNN slower than GRU?

In lab C3_W2_lecture_nb_2_RNNs, part 3, there is a challenge question:

Sometimes, although a rare occurrence, Vanilla RNNs take more time. Can you figure out what might cause this ?

Could you provide any guidance to answer this question?

Hi @Yuncheng_Liao

That is a good question. :+1:

To be fair, I’m not sure what this “rare occurrence” might be… I re-read the whole paragraph:

As you were told in the lectures, GRUs take more time to compute (However, sometimes, although a rare occurrence, Vanilla RNNs take more time. Can you figure out what might cause this ?). This means that training and prediction would take more time for a GRU than for a vanilla RNN. However, GRUs allow you to propagate relevant information even for long sequences, so when selecting an architecture for NLP you should assess the tradeoff between computational time and performance.

And I take the previous sentence as a context - specifically talking about the compute time (not training time (convergence) or other time) I haven’t encountered that rare occurrence. My random thoughts:

  • Cuda implementation of GRU’s could be faster than Vanilla RNN, but I don’t think that is what asked here… (rare occurrence)
  • Vanishing / exploding gradient problem of the Vanilla RNN (compared to GRU) should not influence compute time as a “rare occurrence”
  • weight initialization should also not be the scapegoat for compute time
  • comparing different sizes is also not a “rare occurrence”

In general, this seems like impossible (having the same inputs/dataset, the same amount of “units”, comparing the same DL frameworks…)

I don’t know :slight_smile:

P.S. maybe mentors from other specializations know the answer (@paulinpaloalto, @saifkhanengr, @Elemento, @TMosh) ?

I don’t have anything to add to this.

1 Like

That’s really challenging or seems impossible to me. What that “rare occurrence” might be? Maybe a very long input sentence. Not sure. Following this thread to see what others will add…

1 Like

Hey @Yuncheng_Liao,
Welcome, and we are glad that you could become a part of our community :partying_face:

As others have stated, for single epoch, I can’t seem to imagine why a GRU might take less time than a vanilla RNN, provided that we have the same number of layers, same number of units, same batch-size, etc.

But if we borrow a statement from the aforementioned paragraph, to be exact, the below one:

This means that training and prediction would take more time for a GRU than for a vanilla RNN.

Reflecting on this statement, if we define “training time” as the time taken to converge to a pre-defined threshold, and this threshold is the same for both, vanilla RNN and GRU; in that case, this rare occurrence could be the case, when the vanilla RNN is unable to converge to that pre-defined threshold, or takes a long time to converge, since it is unable to propagate relevant information for long sequences.

Borrowing another statement:

… you should assess the tradeoff between computational time and performance.

In other words, we are fixing the performance threshold, and then we are comparing the computational time for the 2 networks. In this case, I definitely believe that GRU can outshine vanilla RNN in terms of computation time. But what I am confused with is that, “Should this be a rare occurrence?”. GRU is clearly a more sophisticated network than vanilla RNN, so, wouldn’t it outshine vanilla RNN in many cases?

Any thoughts @arvyzukai, @TMosh and @saifkhanengr on this?

Cheers,
Elemento

My intuition: As you mentioned GRU is more sophisticated than vanilla RNN, it will outshine the vanilla RNN in terms of performance, not in terms of time. GRU has gates and more parameters than vanilla RNN (keeping the hyperparameters same for both), so, it takes more time than vanilla. However, as vanilla RNN has no memory term, it has a chance of vanishing gradient for long a sentence. However, this is a normal occurrence, and cannot count as a “rare occurrence” as mentioned by Arvydas. Maybe vanishing gradient with a short sentence could be a rare occurrence. Don’t know.

Regarding your point on the pre-defined threshold, ultimately the cause root is long sequences, as you mentioned, “vanilla RNN is unable to converge… or takes a long time to converge, since it is unable to propagate relevant information for long sequences.” But, in my opinion, this is normal for vanilla RNN to be unable to converge for a long input.

One thing that comes to my mind as a “rare occurrence” is Exploding gradient. Assume we are using clipping (to clip gradient after some threshold). But at every iteration, the gradient keeps exploding again, and again and we try clipping it again and again and the circle goes on, and never converges… What do you say?

Best,
Saif.

1 Like

Hey @saifkhanengr,

Usually when I see exploding gradients in any network, and when we try clipping, I see this pattern quite common, i.e., we perform clipping, and the gradient keeps on exploding at every iteration. In fact, I have seen this trend very common with vanilla RNNs for long sequences, where they are unable to converge. So, again, the question arises, “Can we count this as a rare occurrence?”.

I believe we can raise an issue regarding this with the team, what do you think? Perhaps they might have incorrectly referred to these scenarios as “rare occurrences”.

Cheers,
Elemento

Hey @Elemento

OK, if we would understand this “compute time” as “training time”, then yes, I can imagine a rare case when it can just barely achieve the performance and this results in longer training time (but eventually it gets there). In other words, in most cases Vanilla RNN training time should be shorter unless this rare case when it’s on the limit of its vanishing/exploding gradient (or parameter size).

But to me, the question is worded as if we were asked about the compute time. For example, the following sentence includes the word prediction:

This means that training and prediction would take more time for a GRU than for a vanilla RNN.

In other words, we could rephrase the previous question as “what could be a rare case when Vanilla RNN takes more time to compute its prediction?”.

And also the original sentence:

As you were told in the lectures, GRUs take more time to compute…

If they would have chosen the word “train” here and not “compute” then the sentence is not necessarily correct and probably the opposite - usually the datasets are complex enough, so the GRUs should reduce the loss faster and converge faster. In other words, the “rare occurrence” should not be about training time.

By the way, thank you all for your thoughts :+1:

Hey @arvyzukai,
Indeed you are correct! In the markdown mentioned in the notebook, the word that has been used is “compute”, and that too, for both “training” and “prediction”. I am not sure myself as to what “rare occurrence” are they referring to in this context, after following this conversation.

Cheers,
Elemento

I haven’t taken NLP Specialization, so, you guys know better than me whether raise an issue or not. I am following this thread to learn.

1 Like

Hey Guys,
Well, it looks like all of us are in the fix, as to what this “rare occurrence” might be. Let me create an issue, so that, either they can modify the markdown or satisfy our curiosity :nerd_face:

Cheers,
Elemento

1 Like

Hey guys.

The truth has become evident… specifically when a Vanilla RNN encounters a black hole… just kidding… :slight_smile:

The true issue with the question lies not with RNNs themselves, but rather with the measurement of time (the wall clock time vs. actual computation time). It is likely that the sentence was added to address potential confusion arising from variations in cell execution times due to server (the picture bellow).

However, as evident, it has only contributed to further confusion. As a result the question was deleted :slight_smile:

Thank you for your time :+1:

3 Likes

Hahaha :laughing::laughing::laughing:. Interesting…

1 Like

The original statement that in rare occasion RNN can be slower than GRU has drawn my attention. I think it worth some exploration so I asked the question. It’s a surprise to me that we have a discussion here to scrutinize the original statement. Thank you guys for your help! :face_with_peeking_eye:

1 Like