I need some help. I’ve developed a model trained on a housing dataset using logistic regression. However, the accuracy I’m achieving is only 4.96%. Can you help me figure out why it’s not doing better?
Having categorical features does not require you use logistic regression. Category features are usually converted to one-hot true (1) / false (0) values.
The key between linear and logistic regression is what is being predicted.
If the output is a real value, then it’s linear regression.
If the output is true/false or a classification, then it’s logistic regression.
In that @TMosh picture, I separated the feature variables by removing the target variable from the dataset. I assigned these features to the ‘x’ variable. On the other hand, I assigned the target variable ‘median_house_value’ to the ‘y’ variable.
Can I know how the housing prices changes with median house value variable?
Also by stating removing the target variable from dataset, you extracting only the particular categorical value from dataset??
If for the above question answer is yes then you do not need to use df.drop, rather call it by df.head, select the defined column(make sure you have removed any null values). then check related of the defined column to the housing price.
I checked your notebook and dataset. can you explain what kind of model you are trying to create as no where you explained in the post what kind of correlation you are creating with your model.
You have used latitude (which is negative variable) and total rooms to get a median housing value (done incorrectly as I cannot see you creating any relation between these variables other than graph showing latitude would not be the right variable to get median housing value.
Next what @TMosh mentioned as your data seemed to be wanting to do logistic regression but you have created linear regression which creating all the issue.
So, kindly first brief us what kind of model are you trying to create based on what features or what you are trying to analyse?
In case you are creating a regression analysis between median housing value and total rooms then try to find what is relation between the two.
My suggestion would be to create relation between median housing age and total rooms to median housing value.(This suggestion is without knowing what you are basically looking for in your model.)
Looking at the data set, it appears to me that this data set depicts the median house price within georgraphic areas that are identified by a central latitude/longitude point.
The other columns create the X training features as shown below.
So the goal is to predict the median house price as a function of the location and ocean proximity.
@TMosh I did try normalization, but unfortunately, it didn’t improve the accuracy of the model. I also experimented with different regressor models, but the results remained unsatisfactory.
@Deepti_Prasad, This assignment involves using a provided dataset to make predictions using regressor models. The target variable for prediction is the median house value.
So based on this statement, I can consider that the target variable is only important, how many numbers of feature variables you used is not mandatory right?
I think If I was you, I would first go with simplest of linear regression model (for example to find correlation of median house age to median house value), then switch how Tom is telling you to apply the variable feature of latitude or longitude to the target variable.
I saw your updated assignment. The normalization of data is only getting applied to the ocean proximity column, which could be one of the issue with your model.
Why do not you scale your features based on feature parameters defining a fixed value like the maximum value of median house age being a y value for a set of features of x variables which above that scale marked as 1 and below as 0.
Various normalisation technique you could use, either apply max or min parameter (if you have extreme outliers), or log of x features if you can features the x variable on a common scale, or use a z_Score if you do not have any extreme outliers.
I’m struggling to figure out how to use math with a dataset. So far, I’ve only used built-in functions. Could you explain to me how to apply math to a dataset?
What is your reason for believing the accuracy of the model can be improved?
Maybe your results are as good as possible for that dataset. Perhaps housing prices can’t be accurately predicted using a linear combination of these features.
Are there some reference results from other people who have worked on this data set?
I recommend you test your code with a simple invented dataset that has a known simple solution (see below). Then you can say whether there is an issue in your code for the model, or whether this is just a difficult dataset to use.
= = = = =
Testing your model code using an invented dataset that has more predictable results:
You could add some code that uses your normalized data, you assign some arbitrary weight values, and compute a set of ‘y’ values you can use for training.
Your model should be able to perfectly re-create these ‘y’ values, and give the same weights as you assigned. The “r2_score” in this case should be very close to 1.0.