From my point of view initializing it with 0 is correct. As we are looping over the emails it can only be either spam or not spam.
Moreover the next statement is checking for ham and adding +1 to word_dict[word[‘ham’] IF TRUE and doing the same for spam adding +1 to word_dict[word[‘spam’] if TRUE
The first test is only having one Spam mail [1,0,0]. So it is not possible to have a word appearing more than once in the spam category. Also the sum of ham and spam per word can not be larger then the number of mails.
##### Expected Output (the output order may vary, what is important is the value for each word).
The reason for initializing the counts to 1 rather than 0 is explained in detail in the instructions for that section. We are computing the product of the probabilities of each word in a given email appearing in a spam (or ham in the other case) email. But what if there is a single word in the email that literally appears in 0 spam emails in the particular training corpus that you are working with? Then the whole product value gets killed by the zero probability for that one word. This is a strategy to avoid that scenario. This may seem like a bit of a “hack”, but the point is that the training corpus is limited and there is (at least in principle) no single word that can possibly guarantee an email is not spam in every possible scenario. Remember that we’re trying to train a model that will work well on arbitrary input data that it has not been actually trained on. What is a word you can think of such that it is literally impossible to construct a spam email that actually contains that word?
Remember that we’re computing the total frequencies of each word appearing in spam and ham emails in the training corpus.
Maybe I’m not getting your point here, but you’re right that any given mail is one or the other. But remember that some words in the corpus will appear in both spam and ham emails. Some may appear only in one type, but will have a count of 1 for the missing type given the way the instructions told us to compute the counts.