My first logistic regression model gives bad accuracy. please check

Hey everyone,

I need some help. I’ve developed a model trained on a housing dataset using logistic regression. However, the accuracy I’m achieving is only 4.96%. Can you help me figure out why it’s not doing better?

here is the code:
Assignments (1).ipynb (364.0 KB)

Dataset:
housing[1].csv (1.1 MB)

1 Like

Is this suppose to be a linear regression problem instead of classification problem? Just my casual guess. I seem to see this problem somewhere b4.

1 Like

From looking at your notebook, you’re trying to predict the “median house value”. That’s a real number. So logistic regression is the wrong method.

1 Like

Hello @Amarta_Waghani

based on @TMosh shared pics shows you didn’t show any relation between features and your target variable.

Choosing axis=1 with df.drop basically deletes the whole column.

Regards
DP

2 Likes

@chuaal, In this dataset, there’s a column containing categorical values, which is why logistic regression is being used instead of linear regression.

1 Like

Having categorical features does not require you use logistic regression. Category features are usually converted to one-hot true (1) / false (0) values.

The key between linear and logistic regression is what is being predicted.

  • If the output is a real value, then it’s linear regression.
  • If the output is true/false or a classification, then it’s logistic regression.
2 Likes

Hi @Deepti_Prasad

In that @TMosh picture, I separated the feature variables by removing the target variable from the dataset. I assigned these features to the ‘x’ variable. On the other hand, I assigned the target variable ‘median_house_value’ to the ‘y’ variable.

1 Like

@Amarta_Waghani

Can I know how the housing prices changes with median house value variable?

Also by stating removing the target variable from dataset, you extracting only the particular categorical value from dataset??
If for the above question answer is yes then you do not need to use df.drop, rather call it by df.head, select the defined column(make sure you have removed any null values). then check related of the defined column to the housing price.

Regards
DP

1 Like

@Deepti_Prasad,

I believe you might not have seen the entire code. Could you please review the code and let me know what mistakes I might have made?

Thanks!

1 Like

I looked at your notebook again, and the key error is here:
image

Since you are trying to predict the median_house_value, which is a real floating-point number, you need to use linear regression.

Not logistic regression - that is used for predicting classifications.

1 Like

Hello @Amarta_Waghani,

I checked your notebook and dataset. can you explain what kind of model you are trying to create as no where you explained in the post what kind of correlation you are creating with your model.

You have used latitude (which is negative variable) and total rooms to get a median housing value (done incorrectly as I cannot see you creating any relation between these variables other than graph showing latitude would not be the right variable to get median housing value.

Next what @TMosh mentioned as your data seemed to be wanting to do logistic regression but you have created linear regression which creating all the issue.

So, kindly first brief us what kind of model are you trying to create based on what features or what you are trying to analyse?

In case you are creating a regression analysis between median housing value and total rooms then try to find what is relation between the two.

My suggestion would be to create relation between median housing age and total rooms to median housing value.(This suggestion is without knowing what you are basically looking for in your model.)

Regards
DP

1 Like

Hey there,

@Deepti_Prasad @TMosh,

I’ve made some updates to the code. Here’s the description:

  • Preprocess the data:
    • Select the target variable as median_house_value.
    • Remove unnecessary columns.
    • Perform one-hot encoding on ocean_proximity.
    • Implement IQR to remove outliers.
    • Implement linear regression.

However, it still gives a root mean square prediction of 0.25.

Assignments (2).ipynb (261.7 KB)

Hello @Amarta_Waghani

Can you please let me know what kind of model analysis you are trying to build? Why median house value holds importance in your analysis?

As I can see now you have changed your relative analysis from median house value to ocean proximity I am unsure about your analysis perspective.

Regards
DP

Looking at the data set, it appears to me that this data set depicts the median house price within georgraphic areas that are identified by a central latitude/longitude point.

The other columns create the X training features as shown below.

So the goal is to predict the median house price as a function of the location and ocean proximity.

This is modeled using linear regression.

That all seems fine.

What seems to be missing is normalizing the data set before training.

1 Like