Can someone help me to find total number of parameters in transformer base model, big model in `Attention is all you need paper`

.

Also, please describe each parameter when used in calculations.

Number of parameters in each multi-head attention layer:

`πππ‘π‘=π(ππ)+(π(πππ)+π(ππΎπ)+π(πππ))Γβ`

I could come up with this only?

Where:

- ( N(W_O) ) is the number of parameters in the output weight matrix ( W_O ).
- ( N(W_{Qi}) ) is the number of parameters in the query weight matrix ( W_{Qi} ).
- ( N(W_{Ki}) ) is the number of parameters in the key weight matrix ( W_{Ki} ).
- ( N(W_{Vi}) ) is the number of parameters in the value weight matrix ( W_{Vi} ).
- ( h ) is the number of attention heads.