C3_W2_Lab2_Ex1_indices for users?

[user id] [rating count] [rating ave] Act ion Adve nture Anim ation Chil dren Com edy Crime Docum entary Drama Fan tasy Hor ror Mys tery Rom ance Sci -Fi Thri ller
1 0 -1.0 -0.8 -0.7 0.1 -0.0 -1.2 -0.4 0.6 -0.5 -0.5 -0.1 -0.6 -0.6 -0.7 -0.7
0 1 -0.7 -0.5 -0.7 -0.1 -0.2 -0.6 -0.2 0.7 -0.5 -0.8 0.1 -0.0 -0.6 -0.5 -0.4
-1 -1 -0.2 0.3 -0.4 0.4 0.5 1.0 0.6 -1.2 -0.3 -0.6 -2.3 -0.1 0.0 0.4 -0.0
0 -1 0.6 0.5 0.5 0.2 0.6 -0.1 0.5 -1.2 0.9 1.2 -2.3 -0.1 0.0 0.2 0.3
-1 0 0.7 0.6 0.5 0.3 0.5 0.4 0.6 1.0 0.6 0.3 0.8 0.8 0.4 0.7 0.7

Have means been subtracted from user id and rating count? Why, and how are 0’s and -1’s interpreted?

Hi. Yes, all the data is preprocessed using the StandardScaler method from sklearn, which standardize all features by removing the mean and scaling to unit variance. You can check the documentation here if you want more details.

The user_id is also incorporate as predicted feature, what this is doing is to evaluate how old a user is regarding the mean. For instance, let’s say we have 5 users, the mean would be:

1 + 2 + 3 + 4 + 5 = 15

15/5 = 3

mean = 3
sd = 1.4

So the scale in this case does the function to say on average how old a user is by doing:

(1 - 3)/1.4 = -1.4

The value for the first user would be -1.4, in the print the numbers are rounded, but they contain the actual values.

Let me know if this helps!

Hello @Richard_Rasiej,

When you run the code in order, you will first see the data unnomalized:

However, after you run the normalization code, then go back to print the data, you will see “normalized” version:

Here the first and the second columns are set to print as integers. You can see the original, decimal numbers with print(user_train).

Then after you also run the train-test-splitting code, the samples are shuffled and sampled, and at this time if you, again, go back to print the data, you get what you have posted:

So, yes, means have been subtracted from user id and rating counts, because as instructed by the code, and those 0 and 1 are rounded integers.

I think the more interesting question is, would you use user id and movie id as part of the training features. What do you think?

Cheers,
Raymond

I would think that the ids have nothing to do with the training, but only the features of whatever is associated with the ids. You’re not going to be predicting a movie id or a user id.

Exactly, unless, for example, the movie id is in the order of time so it may possess some release time information that I cannot get from other features. However, it is unlikely I would include any of them.

:ok_hand:

I’m glad that all these ideas are sinking in.

1 Like