In slide 71 of Module 3, why there isn’t a MaxSim for the first word?
hi @billyboe
MaxSim is the maximum similarity score calculated between a query token and all tokens in a separate document within information retrieval models.
The reason there is no MaxSim value for the first text (or token) in a given context is that MaxSim is an asymmetric, comparative metric that requires two distinct sets of tokens (a query and a document) to compare against each other.
MaxSim requires comparison- MaxSim operates by taking a single token from a query text and finding its single most similar token within an entirely separate document text. This process is then repeated for every token in the query, and the maximum similarity scores are summed up to get a final relevance score for the document.
A standalone first token has nothing to compare to - A first text in a token (likely referring to the very first token in a single given input) has no other text to serve as a separate document for comparison.
Lastly, MaxSim is a cross-text metric - The metric is designed to measure the relevance of one piece of text (a document) to another piece of text (a query), not a property of an individual token in isolation within its own document.
Hope this clears your doubt.
Regards
DP
Sorry it doesn’t, and honestly sound like a AI generated answer
The reason there is no MaxSim value for the first text (or token) in a given context is that MaxSim is an asymmetric, comparative metric that requires two distinct sets of tokens (a query and a document) to compare against each other.
What does this phrase mean? The reason is the definition of the function?
A standalone first token has nothing to compare to - A first text in a token (likely referring to the very first token in a single given input) has no other text to serve as a separate document for comparison.
Yes it does, the token “The” of the document can be compared with all the other prompt tokens like the other tokens of the document have been compared to
it can be compared but MaxSim ignores the stop words like the, a, is as it adds noise to text analysis especially in techniques like TF-IDF where weighting scheme gives high scores to words that appear often in a specific document (high TF) but rarely across the entire collection (high IDF). Stop words have low IDF, thus low overall score.
in preprocessing step, before vectorisation (converting text into numbers), a standard step is to filter out predefined list of common stop words.
If the were included, its high frequency might skew the vector’s direction, even if the core meaning (nouns, verbs) is similar, making the similarity score less accurate for topic relevance. By removing the, the vectors align better based on meaningful terms, resulting in a more reliable MaxSim score for content.
now it makes more sense. Thanks
