What is the dense layer for in week 3 assignment?

Robert_Garner · April 7, 2022, 10:41am

I am struggling a bit with what is the purpose of the dense layer in the week 3 assignment (LSTM for Named Entity Recognition). In fact I am not even sure what output from the LSTM is being passed to the dense layer (is it the predictions or the hidden stated or something else). If anybody can offer an explanation or point me at some reading that would explain it, I would be grateful.

By the way, I have already passed the assignment. I don’t need help with that. I just want to understand…

Thanks!

arvyzukai · April 7, 2022, 1:07pm

Hi, @Robert_Garner

I recently answered a similar question explaining the shapes of Tensors that pass through the model:

To answer your question directly:
Dense layer projects LSTM output to as many outputs as you need - 17 in this case - 17 probabilities of one category (tag) or the other.

Robert_Garner · April 11, 2022, 5:31am

Thank you for your reply. That did actually help, although I still have some uncertainty as to exactly how the dense layer is applied. Taking your example and reducing the batch size to 1 (I don’t think batch size should affect the intuition significantly, at least for prediction), we have an array with 30 rows (one for each word) and 50 columns (one for each embedding dimension). I now understand that the dense layer is reducing this to an array of 30 rows and 17 columns where the columns represent tags. My remaining questions are firstly, I assume that the rows of lstm_out come from the h in the LSTM Architecture diagram (a in Proff Ng’s diagram. Is that correct? Second, can I think of this dense layer as a seperate dense layer for each row (word)? If yes, do each of those dense layers have the same weights?

arvyzukai · April 11, 2022, 6:03am

The dense layer reduces hidden dimension (50) to 17, not the rows (30).

You can replace 5 batch size with 1 and the dimensions in my mentioned example would remain the same. (Note, LSTM hidden dimension could be any number, e.g. 40, then dense layer would reduce 40 to 17)

Here is another example with batch_size=1:
X1 dimensions (1, 30) # 1 sentence, 30 words (padded/truncated)
after embedding - (1, 30, 50)
after LSTM - (1, 30, 40)
after Dense - (1, 30, 17)
apply softmax - (1, 30, 17)
now you have 1 sentence with 30 words (padded/truncated) and 17 probabilities of a tag. The predict function later takes this input and picks argmax on last dimension - which tag has the biggest probability for given words, as a result you get (1, 30) - 1 sentence 30 tags (in place of padded/truncated words)

As you know, Dense layer is just W and b matrixes with shapes that project LSTM output to the shape we want (e.g. W.shape (40, 17) and b.shape (17,)). So if you apply np.matmul(ltsm_out, W) + b you basically transform (1, 30, 40) to (1, 30, 17)

Topic		Replies	Views
Understanding the loss of this many to many architecture with LSTM layer NLP with Sequence Models week-3	1	544	April 1, 2022
Problem with understanding tl.Serial NLP with Sequence Models week-3	3	565	July 1, 2022
Possible Discrepancy in Markdown: Assignment NLP with Sequence Models week-1	8	522	December 29, 2022
Significance of Dense Layer NLP with Sequence Models week-2 , week-3 , ai-discussions	1	13	August 21, 2024
Week 3 Assignment: Why there are 2 Dense layers instead of one Sequence Models coursera-platform	2	498	November 21, 2022

What is the dense layer for in week 3 assignment?

Related topics