Data sizes for programing assignment of week 1

Fares_Bagh · June 25, 2024, 6:39pm

Hi
why do have different size between the english and protoguese data.
Tokenized english sentence:(14,)
Tokenized portuguese sentence (shifted to the right):(15,)
Tokenized portuguese sentence:(15,)
i can see having temporary version as we adjust for SOS and EOS. But why we stay with 15 for the proteguese?
Thank you

Fares_Bagh · June 26, 2024, 9:09pm

please ignore this question. the numbers I am using here may not be correct.

Fares_Bagh · June 26, 2024, 11:53pm

dug into it and seeing this
Tokenized english sentence:
[ 2 210 9 146 123 38 9 1672 4 3 0 0 0 0]

Tokenized portuguese sentence (shifted to the right):
[ 2 1085 7 128 11 389 37 2038 4 0 0 0 0 0
0]

Tokenized portuguese sentence:
[1085 7 128 11 389 37 2038 4 3 0 0 0 0 0
0]

I do not know why the portuguese is larger by one. 14( eng) vs. 15( por). is this something the vectorizer decided based on the size of the portuguese sentences. or something else

Fares_Bagh · June 27, 2024, 12:05am

it seems dictate by the size of largest sentnces english and protuguese

tf.Tensor(
[[ 2 6 186 7 124 15 72 24 133 6 18 251 4 3]
[ 2 5 59 373 33 479 96 9 61 6 67 114 4 3]], shape=(2, 14), dtype=int64)
tf.Tensor(
[[ 2 8 51 6 7 5 644 6 9 51 6 23 13 1941
4]], shape=(1, 15), dtype=int64)

Topic		Replies	Views
UNQ_C9 About the next_symbol and the model NLP with Attention Models week-2	5	483	July 17, 2023
Test vocab size mismatch for Exercise 1 Updated NER assignment NLP with Sequence Models week-2	10	438	January 18, 2024
Vocabulary size differs from your preloaded V dimension of W1,W2 and b2 for testing NLP with Probabilistic Models week-4	5	517	March 27, 2023
C3W2_Assignment - fit_vectorizer Natural Language Processing in TensorFlow week-2	6	241	August 20, 2024
C3W1-Assignment -> too much words in vocab and wrong shape Natural Language Processing in TensorFlow	12	436	January 25, 2024

Data sizes for programing assignment of week 1

Related topics