C3W1 Assignment Error

I need some help with this assignment. C3W1 Naive Bayes.

I implemented the get_word_frecuency as described, but test_output and unit test fail.

{CODES REMOVED BY MODERATOR}-AGAINST COMMUNITY GUIDELINES, REFER CODE OF CONDUCT FAQ

How can river be expected to have 2 spam and 3 ham when input emails are only 3? for the test_output?

In unit tests how can word counting for work processed (for example) expect 2 spam 1 ham when it only appears 1 time in the test_case emails???

I am losing my mind. This doesn’t make any sense. Either I’ve been at this too long and missing something or expected outputs and tests are wrong.

week-1
Assignment: C3_W1 Assignment
Math for Machine Learning
Probability & Statistics for Machine Learning &...

(ASSIGNMENT LINK TO THE LAB REMOVED BY THE MODERATOR)-AGAINST COMMUNITY GUIDELINES, REFER CODE OF CONDUCT FAQ

These should both be 1’s, not 0’s.
image

Please don’t post your code on the forum. That’s not allowed by the Code of Conduct.

I see. If the count is actually 0, what is the reasoning for setting an initial 1?

The comment on the line above that bit of code says why.

Fighting also with this part.

From my point of view initializing it with 0 is correct. As we are looping over the emails it can only be either spam or not spam.
Moreover the next statement is checking for ham and adding +1 to word_dict[word[‘ham’] IF TRUE and doing the same for spam adding +1 to word_dict[word[‘spam’] if TRUE

The first test is only having one Spam mail [1,0,0]. So it is not possible to have a word appearing more than once in the spam category. Also the sum of ham and spam per word can not be larger then the number of mails.

##### Expected Output (the output order may vary, what is important is the value for each word).

*** ***{'going': {'spam': 2, 'ham': 1}, 'river': {'spam': 2, 'ham': 3}, 'like': {'spam': 2, 'ham': 1}, 'deep': {'spam': 1, 'ham': 2}, 'love': {'spam': 1, 'ham': 2}, 'hate': {'spam': 1, 'ham': 2}}*** ***

Sorry, but i do not have access to the notebook at this time. So i cannot check the details.

The reason for initializing the counts to 1 rather than 0 is explained in detail in the instructions for that section. We are computing the product of the probabilities of each word in a given email appearing in a spam (or ham in the other case) email. But what if there is a single word in the email that literally appears in 0 spam emails in the particular training corpus that you are working with? Then the whole product value gets killed by the zero probability for that one word. This is a strategy to avoid that scenario. This may seem like a bit of a “hack”, but the point is that the training corpus is limited and there is (at least in principle) no single word that can possibly guarantee an email is not spam in every possible scenario. Remember that we’re trying to train a model that will work well on arbitrary input data that it has not been actually trained on. What is a word you can think of such that it is literally impossible to construct a spam email that actually contains that word?

Remember that we’re computing the total frequencies of each word appearing in spam and ham emails in the training corpus.

Maybe I’m not getting your point here, but you’re right that any given mail is one or the other. But remember that some words in the corpus will appear in both spam and ham emails. Some may appear only in one type, but will have a count of 1 for the missing type given the way the instructions told us to compute the counts.

1 Like

Thank you for the explanation.