Why are tweets clustered in LR training visualization?

In the third notebook from Course 1, week 1 (C1_W1_lecture_nb_03_logistic_regression_model), there is a scatter plot of the two logistic regression features for the training set:


In this visualization, most of the training sets seem to be clustered into small areas of the plane, and it looks to me as if these clusters appear at regular intervals along either the Negative or Positive axes. Does anyone know the source of this (possibly periodic) clustering?

I think these data is very well balanced dataset and that is why you have the same number of inputs for both sentiments and with the same (ranges) values for negative and positive.

Figured it out. If you filter on tweets with first (positive) feature > 8000, you’ll see that they have lots of smiley faces :). Similarly, for large negative values, you’ll see lots of frowny faces :(.

The statistics for smiley face are 2960:2 positive, while frowny face is 3675:1 negative.

So presumably the clusters near 0, 0 have no smiley or frowny faces, while each successive cluster has 1, 2, 3, etc.

1 Like