I noticed that the dense layer parameter n_units=d_ff in feed forward layer is different from the dense layer parmater n_units in the output n_units=d_model. What is the difference between d_ff and d_model and why d_ff is set to be about four times the size of d_model?
Hi @PZ2004
These are hyper-parameters for the model (attention/embedding layers and feed-forward layer) and are usually the thing the data scientists (not the model ) are searching for to get the best performance out of the model.
It happens to be that in the original paper Attention Is All You Need the authors found these to be best for their task (Check Section 6.2 Model Variations to see what they tried and what were the results).
Cheers