Content-based filtering: If i don't use one of the feature on training data, would the accuracy will be wreck?

Annisa_Qurrota_A_yun · November 28, 2023, 8:00am

Hi everyone, so currently I’m reviewing my lesson on recommender system, especially on the content-based filtering part.

In the Programming Assignment: Deep Learning for Content-Based Filtering (from week 2 of Unsupervised learning) , the course use MovieLens dataset. There are average rating given by each user (from user data; consist of average rating and rating per genre given by user) and average rating of each movie (from movie data; consist of average rating of the movie and binary value for each potential genre).

I just wondering is it possible if i get rid of the average rating from user data and movie data (so it just be rating per genre for user data and binary value for each potential genre in movie data) ? Or it would wreck the accuracy?

Thank you for reading this question. I really hope you understand what i’m trying to say I would appreciate any reply

rmwkwok · November 28, 2023, 8:33am

Hi @Annisa_Qurrota_A_yun,

It is possible to get rid of it, but before we can guess if it will perform worse, we need to have some measurements or observations in hand. Any motivation for you to consider it?

Raymond

Annisa_Qurrota_A_yun · November 28, 2023, 9:17am

so i’m doing my project about scholarship recommendation system. this system will recommend scholarship that suit the user based on their profile. the problem here is on the dataset, actually.

for the user data, i plan to use awards, number of volunteer experience, etc. and give each feature ‘rate’ on scale 0-5 (like movie rating). on the other hands, for the scholarship data, i plan to use the same feature as in the user data, but with binary values (example, if the requirement needs applicant to have awards and volunteer experience, these features will get value 1)

it’s easy to calculate the average ‘rating’ for user data, but i don’t think the same thing can be applied to the scholarship data and i never find ‘rating’ for scholarships . or is it possible if i just generate my own rating for each scholarship (for example, based on acceptance rate, etc)? is there any chance for it to be bias if the rating is not based on the feature listed?

thank you!

rmwkwok · November 28, 2023, 11:05am

Hey @Annisa_Qurrota_A_yun,

Thanks for the info! Very helpful!

The lab was not designed to demo the process of feature engineering which is our focus right now. The process involves idea brainstorming (like those you suggested), and validation (train a model and evaluate it). If we have 5 sets of ideas, we train 5 models, we compare their validation scores, and we know what addition helps. This is the basic idea - whatever we discuss next, we won’t be parted from it.

Great ideas. Just to generalize and extend a bit –

if you gave the awards, how would you judge?
if you received the awards, how would you judge? Any case that a student can only receive one award? Any case that a student would NOT be given more than one award?

Talking to some teachers definitely are going to help your brainstorming.

Well, again, generalize a bit and then re-focus – consider rating as an indicator of interaction between movie and user, then in the case of student-scholarship, what about application rates? successful rates? inquiry rates?

It’s possible as long as it won’t take you too much time, and better if you first write down a list of conditions or an equation for how you come up with the number, so that your decisions won’t shift from student to student. Shifting by itself is a problem.

Your concern on bias is not a problem, because your model can take care of it. I would be more worried on shifting - human is not robot. It can be difficult to maintain the same approach of judgement throughout a long period of time when new information keeps coming in to our mind.

Here is my suggestion. You build a model without your “rating”, and evaluate it. See where you are. See if you are satisfied. If not, then, you make your “rating”, but don’t just make one number. If the generation process involves sticking a few pieces of info together to get the final rating, then don’t stick them together but write them all down separately. The chance is, you may find only 2 of them be improving, then this is how we avoid a total loss if the combined version turns out to be unhelpful.

With the evaluation resutls, you can also decide whether your rating or any of your features added afterwards are introducing problem and what kind of problems they are. In course 2 Andrew talked about how to read the learning curve, right? Now is the time to try and tell yourself what problem you could be facing with any additional features.

To conclude, it’s really an iterative process, so try all the quick features first, and then the unavoidables.

Cheers,
Raymond

Annisa_Qurrota_A_yun · November 29, 2023, 2:04am

Thank you so much for your help Raymond! I’ll try your suggestions. Have a good day

rmwkwok · November 29, 2023, 2:06am

You too, Annisa. Cheers!

Raymond

Topic		Replies	Views
Content-Based less precise than Collaborative Filtering? Unsupervised Learning, Recommenders, Reinforcement week-2	7	554	August 4, 2022
C3_W2 Content-based filtering assignment - Predictions for new user way off in exercise 5.1 Unsupervised Learning, Recommenders, Reinforcement week-2	4	343	January 29, 2024
Prediction in Collabrative filtring Unsupervised Learning, Recommenders, Reinforcement week-2	9	537	March 11, 2023
Content-Based Filtering - Training Data Advanced Learning Algorithms week-2	4	554	December 27, 2022
Collaborative filtering training dataset question Unsupervised Learning, Recommenders, Reinforcement week-3	4	489	November 19, 2022

Content-based filtering: If i don't use one of the feature on training data, would the accuracy will be wreck?

Related topics