I believe I have a problem understanding the basic structure of this architecture used in week 3 graded assignment that tackles NER.
When one input (array of shape (1, max_batch_length, embedding_dimensions)) is fed into the network, does every vector (corresponding to a word in the input) produce an output all the way through the dense layer and the logsoftmax layer?
If so, shouldn’t the labels also be of the shape (batch_size, max_batch_length, tag_map) ? How is every output that is a 17 dimensional
Hi, @Kalana_Induwara_Wije .
This is a good question, let me help you to understand the shapes:
Let’s say
batch_size=5
max_len=30
embedding_dim=50
hidden_dim=50
Say your input X1
is of shape (5, 30)
- 5 sentences with 30 words max and padded
- you pass
X1
to embedding layer:
you get emb_out
of shape (5, 30, 50)
# batch_size, seq_len, emb_dim
- you pass
emb_out
to lstm layer:
you get lstm_out
of shape (5, 30, 50)
# batch_size, seq_len, hidden_dim
- you pass
lstm_out
to dense layer:
you get dense_out
of shape (5, 30, 17)
# batch_size, seq_len, num_tags
- you pass
dense_out
to logsoftmax:
you get logsoftmax_out
of shape (5, 30, 17)
# batch_size, seq_len, num_tags
Logsoftmax changes nothing regarding our predicted label, but we use our loss according to it, so I had to mention it.
So now to your question - you are right that model outputs the shape of batch_size x seq_len x num_tags
but when we make a prediction (predict
function), we use argmax
on the last dimension (tags) and we get prediction of shape (5, 30)
Cheers!
1 Like