I don’t understand Module 2 Slide 25
For each document, it looks like we are counting the number of words and assigning a single score, a single number, to the document.
My understanding was that:
- We build a dictionary of all possible words used
- We embed each document using the dictionary using TF (we didn’t reach TF-IDF yet)
- We embed the query using the dictionary using TF
- We compute the cosine similarity between each document and the query
And that’s where a single score for each document is obtained.
So, our query is “how to build a pizza without a pizza oven”.
Given the dictionary [pizza, oven], the embedding of the query is [2, 1].
The embedding of the first document is [2, 1], so cosine similarity should be 5, not 3.
The embedding of the second document is [1, 3], so cosine similarity should be 7, not 4.
I don’t understand how, where, and why the “TF scoring” of the slides are used.
Also, neither the TF scoring nor the Normalized TF scoring seems to be used anymore afterward.
Slide 32 seems to use the simple scoring (i.e. 1 if the word appears at least one time, and 0 otherwise. Provided that the words appear at least one time, the exact number of times it appears is irrelevant):

