I am struggling a bit with what is the purpose of the dense layer in the week 3 assignment (LSTM for Named Entity Recognition). In fact I am not even sure what output from the LSTM is being passed to the dense layer (is it the predictions or the hidden stated or something else). If anybody can offer an explanation or point me at some reading that would explain it, I would be grateful.
By the way, I have already passed the assignment. I don’t need help with that. I just want to understand…
I recently answered a similar question explaining the shapes of Tensors that pass through the model:
To answer your question directly:
Dense layer projects LSTM output to as many outputs as you need - 17 in this case - 17 probabilities of one category (tag) or the other.
Thank you for your reply. That did actually help, although I still have some uncertainty as to exactly how the dense layer is applied. Taking your example and reducing the batch size to 1 (I don’t think batch size should affect the intuition significantly, at least for prediction), we have an array with 30 rows (one for each word) and 50 columns (one for each embedding dimension). I now understand that the dense layer is reducing this to an array of 30 rows and 17 columns where the columns represent tags. My remaining questions are firstly, I assume that the rows of lstm_out come from the h in the LSTM Architecture diagram (a in Proff Ng’s diagram. Is that correct? Second, can I think of this dense layer as a seperate dense layer for each row (word)? If yes, do each of those dense layers have the same weights?
The dense layer reduces hidden dimension (50) to 17, not the rows (30).
You can replace 5 batch size with 1 and the dimensions in my mentioned example would remain the same. (Note, LSTM hidden dimension could be any number, e.g. 40, then dense layer would reduce 40 to 17)
Here is another example with batch_size=1:
X1 dimensions (1, 30) # 1 sentence, 30 words (padded/truncated)
after embedding - (1, 30, 50)
after LSTM - (1, 30, 40)
after Dense - (1, 30, 17)
apply softmax - (1, 30, 17)
now you have 1 sentence with 30 words (padded/truncated) and 17 probabilities of a tag. The predict function later takes this input and picks argmax on last dimension - which tag has the biggest probability for given words, as a result you get (1, 30) - 1 sentence 30 tags (in place of padded/truncated words)
As you know, Dense layer is just W and b matrixes with shapes that project LSTM output to the shape we want (e.g. W.shape (40, 17) and b.shape (17,)). So if you apply np.matmul(ltsm_out, W) + b you basically transform (1, 30, 40) to (1, 30, 17)