Why are tweets clustered in LR training visualization?

David_Fox · May 22, 2023, 8:18pm

In the third notebook from Course 1, week 1 (C1_W1_lecture_nb_03_logistic_regression_model), there is a scatter plot of the two logistic regression features for the training set:

training_tweets

In this visualization, most of the training sets seem to be clustered into small areas of the plane, and it looks to me as if these clusters appear at regular intervals along either the Negative or Positive axes. Does anyone know the source of this (possibly periodic) clustering?

gent.spah · May 23, 2023, 7:29am

I think these data is very well balanced dataset and that is why you have the same number of inputs for both sentiments and with the same (ranges) values for negative and positive.

David_Fox · May 24, 2023, 10:25pm

Figured it out. If you filter on tweets with first (positive) feature > 8000, you’ll see that they have lots of smiley faces :). Similarly, for large negative values, you’ll see lots of frowny faces :(.

The statistics for smiley face are 2960:2 positive, while frowny face is 3675:1 negative.

So presumably the clusters near 0, 0 have no smiley or frowny faces, while each successive cluster has 1, 2, 3, etc.

Topic		Replies	Views
Query Regarding Logistic Regression Visualization NLP with Classification and Vector Spaces week-1	3	517	February 6, 2023
The logistic regression model seems useless NLP with Classification and Vector Spaces week-1	2	581	February 23, 2022
Week 1: Visualizing tweets and the logistic regression model: direction NLP with Classification and Vector Spaces week-1	5	158	June 5, 2024
Count words for positive and negative frequencies NLP with Classification and Vector Spaces week-1	3	545	May 26, 2023
Confusion in Logistic Regression Overview NLP with Classification and Vector Spaces week-1	5	365	October 30, 2023

Why are tweets clustered in LR training visualization?

Related topics