Data sizes for programing assignment of week 1

Hi
why do have different size between the english and protoguese data.
Tokenized english sentence:(14,)
Tokenized portuguese sentence (shifted to the right):(15,)
Tokenized portuguese sentence:(15,)
i can see having temporary version as we adjust for SOS and EOS. But why we stay with 15 for the proteguese?
Thank you

please ignore this question. the numbers I am using here may not be correct.

dug into it and seeing this
Tokenized english sentence:
[ 2 210 9 146 123 38 9 1672 4 3 0 0 0 0]

Tokenized portuguese sentence (shifted to the right):
[ 2 1085 7 128 11 389 37 2038 4 0 0 0 0 0
0]

Tokenized portuguese sentence:
[1085 7 128 11 389 37 2038 4 3 0 0 0 0 0
0]

I do not know why the portuguese is larger by one. 14( eng) vs. 15( por). is this something the vectorizer decided based on the size of the portuguese sentences. or something else

it seems dictate by the size of largest sentnces english and protuguese

tf.Tensor(
[[ 2 6 186 7 124 15 72 24 133 6 18 251 4 3]
[ 2 5 59 373 33 479 96 9 61 6 67 114 4 3]], shape=(2, 14), dtype=int64)
tf.Tensor(
[[ 2 8 51 6 7 5 644 6 9 51 6 23 13 1941
4]], shape=(1, 15), dtype=int64)

1 Like