I am not able to understand why are we adding newaxis in the padding mask as below to make the dimension [B,1,1,Seq_len] (where seq len is the number of words in the sequence) from [B,Seq_len]
seq[:, tf.newaxis, tf.newaxis, :]
Now here our things are
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
Mask is
[[[[1. 1. 0. 0. 1.]]]
[[[1. 1. 1. 0. 0.]]]
[[[0. 0. 0. 1. 1.]]]]
here we want to have very less values in the place of Zeros in our x vector in order to do the mapping so that when we do softmax its effect is not seen
[[ 7.e+00, 6.e+00, -1.e+09, -1.e+09, 1.e+00],
[ 1.e+00, 2.e+00, 3.e+00, -1.e+09, -1.e+09],
[-1.e+09, -1.e+09, -1.e+09, 4.e+00, 5.e+00]]
Instead we get
[[[[ 7.e+00, 6.e+00, -1.e+09, -1.e+09, 1.e+00],
[ 1.e+00, 2.e+00, -1.e+09, -1.e+09, 0.e+00],
[ 0.e+00, 0.e+00, -1.e+09, -1.e+09, 5.e+00]]],
[[[ 7.e+00, 6.e+00, 0.e+00, -1.e+09, -1.e+09],
[ 1.e+00, 2.e+00, 3.e+00, -1.e+09, -1.e+09],
[ 0.e+00, 0.e+00, 0.e+00, -1.e+09, -1.e+09]]],
[[[-1.e+09, -1.e+09, -1.e+09, 0.e+00, 1.e+00],
[-1.e+09, -1.e+09, -1.e+09, 0.e+00, 0.e+00],
[-1.e+09, -1.e+09, -1.e+09, 4.e+00, 5.e+00]]]]
This is before applying softmax