In order to get the actual weight matrices (and shapes of them) you can play around with the model variable, like:

This is the embedding weights of shape (33300, 1024), or (vocab_size, dim_for_LSTM).

Embedding layer uses vocab_size to initiate this weight matrix (each token is a vector of 1024 numbers).

Other layers, like LSTM has more complicated weight matrices (as you know, the LSTMs are more complicated calculations). For example: model.sublayers[0].sublayers[1].sublayers[0].sublayers[1].weights[1][0][0].shape would result in (2048, 4096), which means that one weight matrix hold parameters for more than one component. You can find this post interesting (about embedding and LSTM calculations),

Please correct me if I am wrong. It seems to me model.sublayers[0].sublayers [1].sublayers[0].sublayers[0].weights.shape gives me the size of the embedding layer weight. Say the input for embedding layer has n tokens. Does the layer generates output of shape [n, 1024]? Basically from the [33000, 1024] matrix it extracts the rows corresponding to the input token indexes, correct?

In general, does TRAX tell you the input/output size of each layer?

Yes, you understand that correctly. In addition, there is usually a batch_size in front.

In other words, if we have [n_sentences, n_tokens_padded] input (n_sentences here is equivalent to batch_size), then the output from the embedding layer is [n_sentences, n_tokens_padded, embedding_size] (for example, (32, 64, 1024)). A simple example.

I’m not sure I understand. In general, you are the one who tells trax what size of each layer you want (and you are the one who has to make sure they are reasonable).

Yes, absolutely. Under the hood it is very similar to Dense (linear) layer, like you said in the first question - it takes n’th token (for example 54) and returns some vector (for example 1024 long row of numbers) which are updated according to the loss (during training).

“I’m not sure I understand. In general, you are the one who tells trax what size of each layer you want (and you are the one who has to make sure they are reasonable).”

For this part, I am not saying the hyper parameters like number of neurons, etc. I am asking that once those hyper params are fixed and with given input data, is there a way to check the dimensions of the data (immediate or output) when it pass through each layer. That will help me better understand details of the model, e.g., the attention layer.