C3_W2 Understanding y_train


I do not understand how the values of y_train were obtained. In addition, the assignment said user 2 rates the movie 6874 3.9. But based on the user data, user 2 rates the action movies a 4.0. Can somebody explain this? Thanks

Hello Jeremy, it’s been a while!

Are you asking about this description?

Above, we can see that movie 6874 is an Action/Crime/Thriller movie released in 2003. User 2 rates action movies as 3.9 on average. MovieLens users gave the movie an average rating of 4. ‘y’ is 4 indicating user 2 rated movie 6874 as a 4 as well. A single training example consists of a row from both the user and item arrays and a rating from y_train.

I think 3.9 and 4 are referring two different things? 3.9 is an average score?


Hey Raymond,

Yes, I was asking about that description. I have been trying to figure out how the course obtains the value 3.9 (the average rating of action movies) when the average rating for the action genre is 4.0 for user 2.

Hello Jeremy,

It doesn’t sound unrealistic to me, because user 2 can like action genre more than the general public. If there are only a total of 3 users and only one movie which is an action, and John, Mary and Tom each rated it as 3.8, 3.9 and 4.0 respectively, then the average rating of action movies is 3.9 but the average rating from Tom alone is 4.0.

To verify,

df1 = pd.DataFrame(item_train, columns=item_features)
df2 = pd.DataFrame(user_train, columns=user_features)

I think the purposes of the following codes are pretty self-explained.

y_train[(df2['user id']==2) & (df1['Action']==1)].mean()
y_train[(df2['user id']==2) & (df1['movie id']==6874)]
y_train[df1['movie id']==6874].mean()

Hi Raymond,

I checked the results for the phrase (User 2 rates action movies as 3.9 on average) using the code below:

y_train[(df2[‘user id’]==2) & (df1[‘Action’]==1)].mean()

However, the output is only 3.55. This seems far from the value stated in the assignment.


Besides it’s far, is there any other issues?

Are there any other questions understanding y_train?

Hi Raymond,

No, I don’t have any other questions. I just find it weird that the average score of action movies by user 2 differs quite a bit. Anyway, I restarted the kernel and ran the code again. I got the following values for when user id == 2 & genre == action. However, the value is still not 3.9. Is this an issue in the programming assignment?

1 Like

Hey Jeremy,

I am going to share this with the course team, and maybe they had come up with 3.9 in a different way and in that case I will give you an update. For now, I don’t see that difference (3.9 vs 3.78) causes any trouble understanding the rest of the assignment, but you as a learner may have a different perspective, and that’s why I asked whether you had other questions.

Please let me know any time if you think of other things going through the rest of the assignment. I plan to share your finding after today.


Hey Raymond,

Thanks for helping me on this matter. I understand that this will not affect the course’s progress. However, I think it’s always a good practice to understand how the data is derived and always check the data for yourself. :smiley:

A successful machine learning model is built upon good data. So. I think it is good practice anyway.

Hey Jeremy,

I totally agree with you!! :wink:


See that the max_count is 5. If you increase that you will see that average goes around 3.9 :slightly_smiling_face:

Hi! Completely new here trying figure out and replicate the results on similar datasets. I don’t quite understand the logic behind y_train. Could you please elaborate on it? Also this architecture means that this is a type of supervised ML as we train the model, just so I understand the concepts…

Hello @SiarheiThor,

If you have gone through the discussion in this thread and also description in the assignment, and still have question about it, would you mind share with us how you understand y_train? I think it is better to start from there.


It states that it is an average rating, i am a bit confused though average of what? and why? item and user features are heavily preprocessed and somewhat hard to follow, I know it is not the objectives of this course but it would be a gold to see more detailed explanation of the construction of all arrays.

Ah, I see. y_train is NOT averaged rating. Let’s read this part of the notebook more carefully:


ave rating is in item_train as one of the item’s feature. y_train is not ave rating. y_train is the rating of an user to a movie.

I do not have the source code that preprocessed the data, but they are putting ave rating as an engineered feature for the movies very likely because it is an useful feature to add. I think taking average should not be difficult to do, right? You only need to group the ratings by movie id, and then calculate the mean in each group.


Jeeeezz, simple thing, but it took me a while. Just to be sure I get it:
user_train - a vector with average ratings of the user per genre based on the user ratings ( in the array all the same for every movie rated by the user),
item_train - a vector consisting of the categorical value indicating the genre for movie (ID rated by the user)
y_train - is the user rating for movie (ID)