Hi @Peixi_Zhu
Yes, you understand that correctly. In addition, there is usually a batch_size in front.
In other words, if we have [n_sentences, n_tokens_padded] input (n_sentences here is equivalent to batch_size), then the output from the embedding layer is [n_sentences, n_tokens_padded, embedding_size] (for example, (32, 64, 1024)). A simple example.
I’m not sure I understand. In general, you are the one who tells trax what size of each layer you want (and you are the one who has to make sure they are reasonable).
Yes, absolutely. Under the hood it is very similar to Dense (linear) layer, like you said in the first question - it takes n’th token (for example 54) and returns some vector (for example 1024 long row of numbers) which are updated according to the loss (during training).
Cheers