Adding features

Hello,

Just coming to adding features video and I have a question about the following orientation from Andrew:
"For structure data problems, usually you have a fixed set of users or a fixed set of restaurants or fixed set of products, making it hard to use data augmentation or collect new data from new users that you don’t have yet on restaurants that may or may not exist. Instead, adding features, can be a more fruitful way to improve the performance of the algorithm to fix problems like this one, identify through error analysis. Additional features like these, can be hand coded or they could in turn be generated by some learning algorithm, such as having a learning average home, try to read the menu and classify meals as vegetarian or not, or having people code this manually could also work depending on your application. "

How can we fill the added features in real world situation; I’m thinking about the example “is a person vegetarian as % probability”. So We should first create a model to fill this added feature based on the historical data of that person? then use the new features for the model doing the recommendation.
And what about “people code this manually”? what does it mean ?
Thank you for your answer.

Hi @Naoufal_Rahali ,

Thanks for reaching out. I have not went through this course material but I think I can help you clarify on here.

So basically, in feature engineering you can create your features on your own by the given dataset.

Let me give you an example: Suppose you have an ecommerce dataset and you are working to build a model to predict the number of visitors. For this you have dataset with multiple columns, and one of the column is “Date”. Now to build our model, using this “Date” column you can create several other columns by your own (These columns are not provided to you but you can engineer them yourself - Feature “Engineering”) like “Hour”, “Day of week”, “Day of Month”, “Month” etc.

Why do we need these columns? - Because “Date” provides us multiple information which can be broken down into above columns. Now if you look at the columns we generated, they all makes sense. Number of visitors will depend on “Hour” (Visitors will be more during the evening and night time when everyone is relaxing after day’s work), “Day of Week” (Visitors will be more during weekends) etc.

This is basically what it means that people can code it manually. By understanding the business problem we are working on, we can engineer our features from the current dataset.

We can also use some techniques like imputation etc.

There are several articles which you can go through over the internet for the same, and here is one of them: https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10

Hope this answers your question. Let me know if you want to know anything else. Happy to help! :slight_smile:

Hi @c.godawat ,
thanks for your answer. the example of the date feature transformation you mentionned may be clear to handle with a direct transformation but what if the new feature is a kind of probability or inference like if you want to add an information that is not obviously existing but a kind of inference. the case of information generated by some learning algorithm is mainly what I’m looking to understand how it works in the practice.
Regards

Hi @Naoufal_Rahali

Such kind of techniques are generally used when we expect new values in a particular feature when we run the model in production. (i.e for eg On training dataset that column has some categories but during production, when the model goes live, we may expect new values in it). So in such cases one option is to predict the values of this feature column.

Second use case might be to fill the missing values by creating a model. Suppose we have f1,f2,f3 as 3 features in which f3 has missing values. So we can create a model to predict values of f3 using features f1 and f2.

One more example is of using unsupervised learning algorithms. Suppose we have a dataset where we have Latitude and Longitude points (Geospatial data). Now using lat. and long. in itself might not be that useful generally (Depends on the problem being solved). So we can use some clustering algorithms which will create clusters based on these coordinates. Hence we use an Unsupervised learning model to generate more useful features.

These are some of the examples where we use some other model and predict features, and then create a final model which will solve our target problem.

Also, note that predicting features might generate more unpredictability, hence proper risk analysis as well as testing should be done while creating models.

Thanks for all your answers :slightly_smiling_face: