Hey ,
In c1w4 programming assignment we see below comment:
In my opinion, here the word columns causes confusion ( I tried transposing matrix X and Y). Because rows in X or Y matrix were embeddings corresponding to spesific word. Would like to know if using columns right way to explain?
Hi @EMIN_MAMMADZADA
I’m not sure. I would agree that the docstring could be better. X and Y has the shape (word_count x embedding_size) and that would be a clearer expression.
Columns represent each embedding feature value (in this case 300 columns = 300 features, a row vector), and, as I understand, the docstring mislead into thinking that each column to represent a word with 300 rows (a column vector).
But if you look at the instructions, they are clearer:
Returns:
- Matrix
X
and matrix Y
, where each row in X is the word embedding for an english word, and the same row in Y is the word embedding for the French version of that English word.
Use the en_fr
dictionary to ensure that the ith row in the X
matrix corresponds to the ith row in the Y
matrix.
Cheers
P.S. Maybe native English speakers would clarify this?
I totally agree that the docstring is confusing and arguably just wrong. To confirm, I added print statements after the unit test cell and here’s what I get:
X_train.shape (4932, 300)
Y_train.shape (4932, 300)
So of the 5000 word pairs in the English to French dictionary, only 4932 of them have embeddings in both languages.
I think I have access to the git repo for NLP, so I will file a bug about this.
Update: actually there’s another problem with that docstring as well: it mentions R as a return value, but that is no longer part of this function.
1 Like
Hi @EMIN_MAMMADZADA ,
The description of X and Y are correct. A matrix (2D) in our case has rows which corresponds to the En/Fr word (data point) and columns to the embeddings (features).
In ML, rows of matrix are represented by the data point and columns as its features.
It might help you to think in terms of another usecase like classifying fruits based on its features like shape, colour, odor, etc. The rows will be fruit A,B… and columns will be shape, colour, odor…
Similarly, in our case, the data point is a word (Either En or FR) and its features are its embeddings (there are 300 features). So the row is a word and column its embedding.
@paulinpaloalto the ‘R’ should be removed though