We talk about the possibility of naively encoding the features for a tweet as a vector of length V where V is the total size of our vocabulary, and then we suggest that a way to compress this is using a frequency dictionary. But no time is spent on why this solution over other possible ways to compress.

The frequency dictionary forces each tweet to be represented by only two numbers (the # of times that shows up in all positive tweets, and the # of times that shows up in all negative tweets). If weâ€™ve gone to that step, then why isnâ€™t it valid to then just add up these values for each word in the tweet and then assume itâ€™s positive if the positive count is greater or negative if the negative count is greater. What is even the point of all the ML steps?

Also why is it important to keep the count of both positive and negative? Couldnâ€™t you reduce one dimension by just storing the positive/negative difference? Is there some scenario where a 2 point difference means one thing with larger counts than it does with smaller counts? If so, then why not store as a difference and a magnitude instead? Are there tradeoffs to consider?

Also why do we introduce a bias of 1 if our whole point in doing all this is to compresss the information weâ€™re attempting to train against.

Also no time is spent on the choice to avoid counting the same word twice. Why is that avoided? If someone tweets â€śI am so happy I canâ€™t tell you how happyâ€ť why would it be bad to count happy twice? Shouldnâ€™t the positive sentiment there be weighted just as much as if they had chosen some synonym since we would count it in that case, i.e. â€śI am so happy I canâ€™t tell you how overjoyedâ€ť