C3_W2 Understanding y_train

jeremyunderscore · October 24, 2022, 4:05am

Hi,

I do not understand how the values of y_train were obtained. In addition, the assignment said user 2 rates the movie 6874 3.9. But based on the user data, user 2 rates the action movies a 4.0. Can somebody explain this? Thanks

rmwkwok · October 24, 2022, 7:29am

Hello Jeremy, it’s been a while!

Are you asking about this description?

Above, we can see that movie 6874 is an Action/Crime/Thriller movie released in 2003. User 2 rates action movies as 3.9 on average. MovieLens users gave the movie an average rating of 4. ‘y’ is 4 indicating user 2 rated movie 6874 as a 4 as well. A single training example consists of a row from both the user and item arrays and a rating from y_train.

I think 3.9 and 4 are referring two different things? 3.9 is an average score?

Raymond

jeremyunderscore · October 24, 2022, 9:38am

Hey Raymond,

Yes, I was asking about that description. I have been trying to figure out how the course obtains the value 3.9 (the average rating of action movies) when the average rating for the action genre is 4.0 for user 2.

rmwkwok · October 24, 2022, 11:16am

Hello Jeremy,

It doesn’t sound unrealistic to me, because user 2 can like action genre more than the general public. If there are only a total of 3 users and only one movie which is an action, and John, Mary and Tom each rated it as 3.8, 3.9 and 4.0 respectively, then the average rating of action movies is 3.9 but the average rating from Tom alone is 4.0.

To verify,

df1 = pd.DataFrame(item_train, columns=item_features)
df2 = pd.DataFrame(user_train, columns=user_features)

I think the purposes of the following codes are pretty self-explained.

y_train[(df2['user id']==2) & (df1['Action']==1)].mean()

y_train[(df2['user id']==2) & (df1['movie id']==6874)]

y_train[df1['movie id']==6874].mean()

jeremyunderscore · October 24, 2022, 3:06pm

Hi Raymond,

I checked the results for the phrase (User 2 rates action movies as 3.9 on average) using the code below:

y_train[(df2[‘user id’]==2) & (df1[‘Action’]==1)].mean()

However, the output is only 3.55. This seems far from the value stated in the assignment.

rmwkwok · October 24, 2022, 5:13pm

Besides it’s far, is there any other issues?

Are there any other questions understanding y_train?

jeremyunderscore · October 25, 2022, 1:15am

Hi Raymond,

No, I don’t have any other questions. I just find it weird that the average score of action movies by user 2 differs quite a bit. Anyway, I restarted the kernel and ran the code again. I got the following values for when user id == 2 & genre == action. However, the value is still not 3.9. Is this an issue in the programming assignment?

rmwkwok · October 25, 2022, 4:24am

Hey Jeremy,

I am going to share this with the course team, and maybe they had come up with 3.9 in a different way and in that case I will give you an update. For now, I don’t see that difference (3.9 vs 3.78) causes any trouble understanding the rest of the assignment, but you as a learner may have a different perspective, and that’s why I asked whether you had other questions.

Please let me know any time if you think of other things going through the rest of the assignment. I plan to share your finding after today.

Raymond

jeremyunderscore · October 25, 2022, 7:01am

Hey Raymond,

Thanks for helping me on this matter. I understand that this will not affect the course’s progress. However, I think it’s always a good practice to understand how the data is derived and always check the data for yourself.

A successful machine learning model is built upon good data. So. I think it is good practice anyway.

rmwkwok · October 25, 2022, 9:58am

Hey Jeremy,

I totally agree with you!!

Raymond

Shashank_Garg · November 14, 2022, 3:16am

See that the max_count is 5. If you increase that you will see that average goes around 3.9

SiarheiThor · January 4, 2023, 3:03pm

Hi! Completely new here trying figure out and replicate the results on similar datasets. I don’t quite understand the logic behind y_train. Could you please elaborate on it? Also this architecture means that this is a type of supervised ML as we train the model, just so I understand the concepts…

rmwkwok · January 5, 2023, 6:14am

Hello @SiarheiThor,

If you have gone through the discussion in this thread and also description in the assignment, and still have question about it, would you mind share with us how you understand y_train? I think it is better to start from there.

Cheers,
Raymond

SiarheiThor · January 5, 2023, 8:36am

It states that it is an average rating, i am a bit confused though average of what? and why? item and user features are heavily preprocessed and somewhat hard to follow, I know it is not the objectives of this course but it would be a gold to see more detailed explanation of the construction of all arrays.

rmwkwok · January 5, 2023, 8:46am

Ah, I see. y_train is NOT averaged rating. Let’s read this part of the notebook more carefully:

ave rating is in item_train as one of the item’s feature. y_train is not ave rating. y_train is the rating of an user to a movie.

I do not have the source code that preprocessed the data, but they are putting ave rating as an engineered feature for the movies very likely because it is an useful feature to add. I think taking average should not be difficult to do, right? You only need to group the ratings by movie id, and then calculate the mean in each group.

Cheers,
Raymond

SiarheiThor · January 5, 2023, 9:30am

Jeeeezz, simple thing, but it took me a while. Just to be sure I get it:
user_train - a vector with average ratings of the user per genre based on the user ratings ( in the array all the same for every movie rated by the user),
item_train - a vector consisting of the categorical value indicating the genre for movie (ID rated by the user)
y_train - is the user rating for movie (ID)

rmwkwok · January 5, 2023, 11:28am

Yes!

Cheers,
Raymond

Topic		Replies	Views
Practice Lab: Deep Learning for Content-Based Filtering what are the y_train values? Unsupervised Learning, Recommenders, Reinforcement week-module-2	4	331	December 6, 2023
RecSysNN_Assignment: clarification on training data Unsupervised Learning, Recommenders, Reinforcement week-module-2	1	504	June 1, 2023
Unsupervised Learning, Content-based Filtering Unsupervised Learning, Recommenders, Reinforcement week-module-2	6	58	July 15, 2024
Two numbers can not match Unsupervised Learning, Recommenders, Reinforcement week-module-2	12	50	July 12, 2024
Week2_lab2 Unsupervised Learning, Recommenders, Reinforcement week-module-2	1	394	August 5, 2023

C3_W2 Understanding y_train

Related topics