I don't understand TF scoring

I don’t understand Module 2 Slide 25

For each document, it looks like we are counting the number of words and assigning a single score, a single number, to the document.

My understanding was that:

  1. We build a dictionary of all possible words used
  2. We embed each document using the dictionary using TF (we didn’t reach TF-IDF yet)
  3. We embed the query using the dictionary using TF
  4. We compute the cosine similarity between each document and the query

And that’s where a single score for each document is obtained.

So, our query is “how to build a pizza without a pizza oven”.

Given the dictionary [pizza, oven], the embedding of the query is [2, 1].

The embedding of the first document is [2, 1], so cosine similarity should be 5, not 3.

The embedding of the second document is [1, 3], so cosine similarity should be 7, not 4.

I don’t understand how, where, and why the “TF scoring” of the slides are used.


Also, neither the TF scoring nor the Normalized TF scoring seems to be used anymore afterward.

Slide 32 seems to use the simple scoring (i.e. 1 if the word appears at least one time, and 0 otherwise. Provided that the words appear at least one time, the exact number of times it appears is irrelevant):

But the Wikipedia page for BM25 seems to sum the scores for each document

I guess we need an ungraded lab where we do retrieval with keyword search rather than with semantic similarity?

Is the actual retrieval algorithm different?

hi @billyboe

TF measures how frequently a term appears in a document. A common calculation is the number of times a term appears in a document divided by the total number of words in that document.

IDF measures how unique a term is across the entire corpus and is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. This helps to downweight common words (like “the”, “a”) that appear in many documents and highlight words that are more specific to a particular document.

Check this link (https://stackoverflow.com/questions/60343826/how-to-manually-calculate-tf-idf-score-from-sklearns-tfidfvectorizer)

This is an example of how to compute the TF-IDF vector manually.

I would need an example of how to compute the similarity between the query and the document after they haven been both embedded in a TF-IDF vector