I don't understand TF scoring

billyboe · October 31, 2025, 11:27am

I don’t understand Module 2 Slide 25

For each document, it looks like we are counting the number of words and assigning a single score, a single number, to the document.

My understanding was that:

We build a dictionary of all possible words used
We embed each document using the dictionary using TF (we didn’t reach TF-IDF yet)
We embed the query using the dictionary using TF
We compute the cosine similarity between each document and the query

And that’s where a single score for each document is obtained.

So, our query is “how to build a pizza without a pizza oven”.

Given the dictionary [pizza, oven], the embedding of the query is [2, 1].

The embedding of the first document is [2, 1], so cosine similarity should be 5, not 3.

The embedding of the second document is [1, 3], so cosine similarity should be 7, not 4.

I don’t understand how, where, and why the “TF scoring” of the slides are used.

Also, neither the TF scoring nor the Normalized TF scoring seems to be used anymore afterward.

Slide 32 seems to use the simple scoring (i.e. 1 if the word appears at least one time, and 0 otherwise. Provided that the words appear at least one time, the exact number of times it appears is irrelevant):

billyboe · October 31, 2025, 11:34am

But the Wikipedia page for BM25 seems to sum the scores for each document

I guess we need an ungraded lab where we do retrieval with keyword search rather than with semantic similarity?

Is the actual retrieval algorithm different?

Deepti_Prasad · October 31, 2025, 1:50pm

hi @billyboe

TF measures how frequently a term appears in a document. A common calculation is the number of times a term appears in a document divided by the total number of words in that document.

IDF measures how unique a term is across the entire corpus and is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. This helps to downweight common words (like “the”, “a”) that appear in many documents and highlight words that are more specific to a particular document.

Check this link (https://stackoverflow.com/questions/60343826/how-to-manually-calculate-tf-idf-score-from-sklearns-tfidfvectorizer)

billyboe · November 2, 2025, 5:07am

This is an example of how to compute the TF-IDF vector manually.

I would need an example of how to compute the similarity between the query and the document after they haven been both embedded in a TF-IDF vector

Topic		Replies	Views
Error in C1M2 of RAG for computed TF-IDF scores? Retrieval Augmented Generation week-module-2 , dl-ai-learning-platform	1	26	February 7, 2026
Module 2: Keyword Search Retrieval Augmented Generation week-module-2 , coursera-platform	1	42	September 17, 2025
M25 Retriever Returns Incorrect Document Rankings for a Relevant Query Retrieval Augmented Generation week-module-2	2	102	July 24, 2025
W2 - quiz - similarity of theta and word embedding vector Sequence Models coursera-platform	2	504	July 5, 2024
C1_W3_Assignment - Getting Tunisia as output of get_country NLP with Classification and Vector Spaces week-module-3	6	45	August 31, 2025

I don't understand TF scoring

Related topics