Can anyone explain how they got the input for BERT model? Element wise sum of position?

Hi, Sindu!

As shown in the figure in your question, the dimensions of position_embedding, segment_embedding, and token_embedding are all (1,4,768), which means that each word in this sentence is represented by a 768-dimensional vector. Each position and each segment in this sentence is also represented by a 768-dimensional vector.

The following is a diagram of a position embedding with a dimension of (2048,512).

image

From the diagram, you can see that each position is represented by a 512-dimensional vector. Hope this can help you understand the position_embedding.

Since the dimensions of position_embedding, segment_embedding, and token_embedding are the same, they can be added element-wise to get the final input.