Hi @kietlac
This table shows the number of occurrences of words “lottery” and “words” in different samples of texts and the true labels of them (whether each sample is spam or not (ham)).
The quiz asks you to assign a score to each of the words “lottery” and “win”, and then find a threshold value. If the sum of the scores for these words in a message exceeds the threshold, the message should be classified as spam; otherwise, it’s considered ham.
Hope it helps! Feel free to ask if you need further assistance.
But how do they calculate it?
They only give the answer without explaining the method
Hi again @kietlac
You must apply what you have been taught in the course to solve this!
In this case, you can treat each message as a 2D vector where the components are the counts of the words “lottery” and “win.” Then find a weight vector (let’s say [w1,w2][w_1, w_2]) such that the dot product of the weight vector with each message vector gives a score. If this score passes a certain threshold, the message is classified as spam. So, it’s just computing the dot product between a weight vector and feature vectors, then comparing the result to a threshold.
Hope it helps! Feel free to ask if you need further assistance.
Hello @Alireza!
Could you please demonstrate how we could use the methods we were taught to derive the threshold value? From the lecture it looked like the solution simply derived the threshold value after visually inspecting what aggregated Scores resulted in Spam=no vs Spam=yes (e.g. per what was shown a threshold of x > 1 could’ve sufficed as well as a threshold value). Thanks in advance for clarifying!!
The method for classifying examples (true/false) is called logistic regression. The M4ML specialization presents that topic (as the “perceptron”) in Course 2 - Week 3.