Hi everyone, so currently I’m reviewing my lesson on recommender system, especially on the content-based filtering part.
In the Programming Assignment: Deep Learning for Content-Based Filtering (from week 2 of Unsupervised learning) , the course use MovieLens dataset. There are average rating given by each user (from user data; consist of average rating and rating per genre given by user) and average rating of each movie (from movie data; consist of average rating of the movie and binary value for each potential genre).
I just wondering is it possible if i get rid of the average rating from user data and movie data (so it just be rating per genre for user data and binary value for each potential genre in movie data) ? Or it would wreck the accuracy?
Thank you for reading this question. I really hope you understand what i’m trying to say I would appreciate any reply
It is possible to get rid of it, but before we can guess if it will perform worse, we need to have some measurements or observations in hand. Any motivation for you to consider it?
so i’m doing my project about scholarship recommendation system. this system will recommend scholarship that suit the user based on their profile. the problem here is on the dataset, actually.
for the user data, i plan to use awards, number of volunteer experience, etc. and give each feature ‘rate’ on scale 0-5 (like movie rating). on the other hands, for the scholarship data, i plan to use the same feature as in the user data, but with binary values (example, if the requirement needs applicant to have awards and volunteer experience, these features will get value 1)
it’s easy to calculate the average ‘rating’ for user data, but i don’t think the same thing can be applied to the scholarship data and i never find ‘rating’ for scholarships . or is it possible if i just generate my own rating for each scholarship (for example, based on acceptance rate, etc)? is there any chance for it to be bias if the rating is not based on the feature listed?
The lab was not designed to demo the process of feature engineering which is our focus right now. The process involves idea brainstorming (like those you suggested), and validation (train a model and evaluate it). If we have 5 sets of ideas, we train 5 models, we compare their validation scores, and we know what addition helps. This is the basic idea - whatever we discuss next, we won’t be parted from it.
Great ideas. Just to generalize and extend a bit –
if you gave the awards, how would you judge?
if you received the awards, how would you judge? Any case that a student can only receive one award? Any case that a student would NOT be given more than one award?
Talking to some teachers definitely are going to help your brainstorming.
Well, again, generalize a bit and then re-focus – consider rating as an indicator of interaction between movie and user, then in the case of student-scholarship, what about application rates? successful rates? inquiry rates?
It’s possible as long as it won’t take you too much time, and better if you first write down a list of conditions or an equation for how you come up with the number, so that your decisions won’t shift from student to student. Shifting by itself is a problem.
Your concern on bias is not a problem, because your model can take care of it. I would be more worried on shifting - human is not robot. It can be difficult to maintain the same approach of judgement throughout a long period of time when new information keeps coming in to our mind.
Here is my suggestion. You build a model without your “rating”, and evaluate it. See where you are. See if you are satisfied. If not, then, you make your “rating”, but don’t just make one number. If the generation process involves sticking a few pieces of info together to get the final rating, then don’t stick them together but write them all down separately. The chance is, you may find only 2 of them be improving, then this is how we avoid a total loss if the combined version turns out to be unhelpful.
With the evaluation resutls, you can also decide whether your rating or any of your features added afterwards are introducing problem and what kind of problems they are. In course 2 Andrew talked about how to read the learning curve, right? Now is the time to try and tell yourself what problem you could be facing with any additional features.
To conclude, it’s really an iterative process, so try all the quick features first, and then the unavoidables.