Understanding the loss of this many to many architecture with LSTM layer

I​ believe I have a problem understanding the basic structure of this architecture used in week 3 graded assignment that tackles NER.

W​hen one input (array of shape (1, max_batch_length, embedding_dimensions)) is fed into the network, does every vector (corresponding to a word in the input) produce an output all the way through the dense layer and the logsoftmax layer?

I​f so, shouldn’t the labels also be of the shape (batch_size, max_batch_length, tag_map) ? How is every output that is a 17 dimensional

Hi, @Kalana_Induwara_Wije .

This is a good question, let me help you to understand the shapes:
Let’s say
batch_size=5
max_len=30
embedding_dim=50
hidden_dim=50
Say your input X1 is of shape (5, 30) - 5 sentences with 30 words max and padded

  1. you pass X1 to embedding layer:
    you get emb_out of shape (5, 30, 50) # batch_size, seq_len, emb_dim
  2. you pass emb_out to lstm layer:
    you get lstm_out of shape (5, 30, 50) # batch_size, seq_len, hidden_dim
  3. you pass lstm_out to dense layer:
    you get dense_out of shape (5, 30, 17) # batch_size, seq_len, num_tags
  4. you pass dense_out to logsoftmax:
    you get logsoftmax_out of shape (5, 30, 17) # batch_size, seq_len, num_tags

Logsoftmax changes nothing regarding our predicted label, but we use our loss according to it, so I had to mention it.

So now to your question - you are right that model outputs the shape of batch_size x seq_len x num_tags but when we make a prediction (predict function), we use argmax on the last dimension (tags) and we get prediction of shape (5, 30)

Cheers!

1 Like