My first logistic regression model gives bad accuracy. please check

Hey everyone,

I need some help. I’ve developed a model trained on a housing dataset using logistic regression. However, the accuracy I’m achieving is only 4.96%. Can you help me figure out why it’s not doing better?

here is the code:
Assignments (1).ipynb (364.0 KB)

Dataset:
housing[1].csv (1.1 MB)

1 Like

Is this suppose to be a linear regression problem instead of classification problem? Just my casual guess. I seem to see this problem somewhere b4.

1 Like

From looking at your notebook, you’re trying to predict the “median house value”. That’s a real number. So logistic regression is the wrong method.

2 Likes

Hello @Amarta_Waghani

based on @TMosh shared pics shows you didn’t show any relation between features and your target variable.

Choosing axis=1 with df.drop basically deletes the whole column.

Regards
DP

2 Likes

@chuaal, In this dataset, there’s a column containing categorical values, which is why logistic regression is being used instead of linear regression.

1 Like

Having categorical features does not require you use logistic regression. Category features are usually converted to one-hot true (1) / false (0) values.

The key between linear and logistic regression is what is being predicted.

  • If the output is a real value, then it’s linear regression.
  • If the output is true/false or a classification, then it’s logistic regression.
2 Likes

Hi @Deepti_Prasad

In that @TMosh picture, I separated the feature variables by removing the target variable from the dataset. I assigned these features to the ‘x’ variable. On the other hand, I assigned the target variable ‘median_house_value’ to the ‘y’ variable.

1 Like

@Amarta_Waghani

Can I know how the housing prices changes with median house value variable?

Also by stating removing the target variable from dataset, you extracting only the particular categorical value from dataset??
If for the above question answer is yes then you do not need to use df.drop, rather call it by df.head, select the defined column(make sure you have removed any null values). then check related of the defined column to the housing price.

Regards
DP

1 Like

@Deepti_Prasad,

I believe you might not have seen the entire code. Could you please review the code and let me know what mistakes I might have made?

Thanks!

1 Like

I looked at your notebook again, and the key error is here:
image

Since you are trying to predict the median_house_value, which is a real floating-point number, you need to use linear regression.

Not logistic regression - that is used for predicting classifications.

3 Likes

Hello @Amarta_Waghani,

I checked your notebook and dataset. can you explain what kind of model you are trying to create as no where you explained in the post what kind of correlation you are creating with your model.

You have used latitude (which is negative variable) and total rooms to get a median housing value (done incorrectly as I cannot see you creating any relation between these variables other than graph showing latitude would not be the right variable to get median housing value.

Next what @TMosh mentioned as your data seemed to be wanting to do logistic regression but you have created linear regression which creating all the issue.

So, kindly first brief us what kind of model are you trying to create based on what features or what you are trying to analyse?

In case you are creating a regression analysis between median housing value and total rooms then try to find what is relation between the two.

My suggestion would be to create relation between median housing age and total rooms to median housing value.(This suggestion is without knowing what you are basically looking for in your model.)

Regards
DP

1 Like

Hey there,

@Deepti_Prasad @TMosh,

I’ve made some updates to the code. Here’s the description:

  • Preprocess the data:
    • Select the target variable as median_house_value.
    • Remove unnecessary columns.
    • Perform one-hot encoding on ocean_proximity.
    • Implement IQR to remove outliers.
    • Implement linear regression.

However, it still gives a root mean square prediction of 0.25.

Assignments (2).ipynb (261.7 KB)

Hello @Amarta_Waghani

Can you please let me know what kind of model analysis you are trying to build? Why median house value holds importance in your analysis?

As I can see now you have changed your relative analysis from median house value to ocean proximity I am unsure about your analysis perspective.

Regards
DP

Looking at the data set, it appears to me that this data set depicts the median house price within georgraphic areas that are identified by a central latitude/longitude point.

The other columns create the X training features as shown below.

So the goal is to predict the median house price as a function of the location and ocean proximity.

This is modeled using linear regression.

That all seems fine.

What seems to be missing is normalizing the data set before training.

1 Like

Hi

@TMosh I did try normalization, but unfortunately, it didn’t improve the accuracy of the model. I also experimented with different regressor models, but the results remained unsatisfactory.

@Deepti_Prasad, This assignment involves using a provided dataset to make predictions using regressor models. The target variable for prediction is the median house value.

here is the code :
Assignments (3).ipynb (262.9 KB)

Hello @Amarta_Waghani

So based on this statement, I can consider that the target variable is only important, how many numbers of feature variables you used is not mandatory right?

I think If I was you, I would first go with simplest of linear regression model (for example to find correlation of median house age to median house value), then switch how Tom is telling you to apply the variable feature of latitude or longitude to the target variable.

I saw your updated assignment. The normalization of data is only getting applied to the ocean proximity column, which could be one of the issue with your model.

Why do not you scale your features based on feature parameters defining a fixed value like the maximum value of median house age being a y value for a set of features of x variables which above that scale marked as 1 and below as 0.

Various normalisation technique you could use, either apply max or min parameter (if you have extreme outliers), or log of x features if you can features the x variable on a common scale, or use a z_Score if you do not have any extreme outliers.

if you work on this probably you will see some better model results.

Regards
DP

1 Like

Hi @Deepti_Prasad,

I am new to this, so I am making some mistakes.

I’m struggling to figure out how to use math with a dataset. So far, I’ve only used built-in functions. Could you explain to me how to apply math to a dataset?

Thanks!

Hi

@Deepti_Prasad, @TMosh

Here is the code link that I found on GitHub, where they work on this dataset.

To test your code, I need the data file.

I presume the “housing[1].csv” file that your notebook is importing came from the repo as “/datasets/housing/housing.csv”?

What is your reason for believing the accuracy of the model can be improved?

Maybe your results are as good as possible for that dataset. Perhaps housing prices can’t be accurately predicted using a linear combination of these features.

Are there some reference results from other people who have worked on this data set?

I recommend you test your code with a simple invented dataset that has a known simple solution (see below). Then you can say whether there is an issue in your code for the model, or whether this is just a difficult dataset to use.

= = = = =

Testing your model code using an invented dataset that has more predictable results:

You could add some code that uses your normalized data, you assign some arbitrary weight values, and compute a set of ‘y’ values you can use for training.

Your model should be able to perfectly re-create these ‘y’ values, and give the same weights as you assigned. The “r2_score” in this case should be very close to 1.0.