From the data generator, training data is a set of batches and each batches consists of multiple sentences.
However, when the training data fits into the model, does the classifier below fits all the words in one sentence all at once and then take the mean of all the words for the final layer, since there is only one value target for each sentence? i.e. for each sentence, the output of embedding layer is (seq_length, embed_dim)?
tl.Mean here takes axis = 1 seems not correct to me since we want the average of each attributes across seq_length of words, so I think axis = 0 should be correct. However, axis=0 will generate errors but axis=1 will not, why?