C1_W4_Assignment UNQ_C16

The original code:

idx = np.argmax(cosine_similarity(document_vecs,tweet_embedding)) 
print(all_tweets[idx])

seems not returning the closing tweet to the input tweet. cosine_similarity function is supposed to take in two vectors, but here it takes one matrix and one vector as input. And if you print the idx and the cosine similarity, they are 7213 and 0.87.

However, there are other idxes that are higher cosine similarity, for example 5202 (took me some time to find this one using nearest_neighbor function and dealing with nan numbers). But it is closer. Looks at the code below:

# UNQ_C16 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# this gives you a similar tweet as your input.
# this implementation is vectorized...
idx = np.argmax(cosine_similarity(document_vecs,tweet_embedding)) 
print(idx)
print(all_tweets[idx])
document_emb = get_document_embedding(all_tweets[idx], en_embeddings_subset)
print(cosine_similarity(document_emb, tweet_embedding))

print('\n')
print(all_tweets[5202])
document_emb = get_document_embedding(all_tweets[5202], en_embeddings_subset)
print(cosine_similarity(document_emb, tweet_embedding))

"""
7213
@zoeeylim sad sad sad kid :( it's ok I help you watch the match HAHAHAHAHA
0.8734228743257326


@hanbined sad pray for me :(((
1.0
"""

Can someone help to check what’s going on here? Thanks!

1 Like

Hi Longyu_Zhao,

Good catch!

The problem is caused by the calculation of the norm of the matrix in cosine_similarity, rather than the required calculation of the norms of each row of the matrix.

I replaced cosine_similarity with an adjusted function cosine_similarity_matrix that takes the norm per row of the matrix, and this led to the correct result:

I will report this issue to people working on the backend, suggesting either an adjustment to cosine_similarity or the inclusion of cosine_similarity_matrix.

Thanks!