Sub-word sequences

The last video in the course, called “Diving into the code (part 2)” mentions that a way is needed to learn from sequences of sub-words, with emphasis on “sequences”, meaning in a certain order. The topic will be addressed in the next course about RNNs.

But have we not been using ordered sequences all along? Where in the process of classification was the order of the sub-words lost? Referring to the lab “Ungraded Lab: Subword Tokenization with the IMDB Reviews Dataset”.

Hi Dan,

Even if the input had been given in order, the different deep learning layers like Dense, CNN do not look at input as sequence of steps, they will look at all the inputs in one go. Dense will try to figure out all the combinations of the input sequence to make sense while CNNs will try to group together words based on the filtering criterias you have set.

If the input was: The Prestige is a good movie, the dense layer will look at those 6 words as one set, it’s as good as feeding it a tabular data where column one value is The, column 2 value is Prestige and so on, it’ll all end up getting processed independently. So if the statement has been: The Prestige is a good movie, you should watch The Prestige, the dense layer is not designed by default to understand that the two occurrences of The Prestige are the same. It can learn it after multiple epochs but it’s not optimized for it.

CNNs on the other hand can learn the importance of groups of words together based on the filter settings but still doesn’t understand the importance of sequence within them.

That’s where RNNs excel, it actually understands the sequence, it understands that if the current word is The, there’s a good chance the next word will be Prestige and so on.

Hope this helps.

1 Like

So it’s about the layers and not the input - thank you.

But even in a Dense layer, aren’t the trained weights of each neuron at least influenced by the order of the words in the training sentences? Because each neuron’s output is based on a weighted sum, with the first weight for the first word, the second weight for the second word, and so on. If the order of the words changes then so does the output, so the order does seem to matter. Or not?

Ignoring tokenization, embedding and sub-wording here. And agreed that RNN / LSTM / GRU is better optimized for sequences than a Dense layer.

Hi Dan,

You are absolutely right when you say order matters in the input even in the case of Dense layers. The only difference is that in case of dense layer, the input sequence is seen as a row of a table and all the input words are treated as independent columns. The dense layer tries to find relationships between these ‘columns’. So, going back to the previous example The Prestige is a good movie, you should watch The Prestige, the dense layer basically sees it as a row of input. So, The Prestige is a good movie, you should watch The Prestige is distinct from Is the Prestige a good movie, you should watch The Prestige or A good movie worth watching is the Prestige.

The dense layer will look at it as three very different inputs and won’t pick up on the importance of the order of the words within the sentences. A GRU or LSTM will be quick to spot that even though The Prestige occurs at 3 different location in the above sentences, they go together. If the first word is the then it’s likely that word following it will be Prestige.

LSTM/GRU tries to learn the importance of relative position of words in an input while Dense layer will only understand the absolute positions in the input. For Dense layer, The Prestige occurring at the start of the statement is very different from The Prestige occurring at the middle which is very different from The Prestige occurring at the end of the sentence. It won’t easily understand that the two words are related irrespective of it’s position in the input sequence.

Hope this clarifies.

Yes, it does clarify. Thank you!