My first logistic regression model gives bad accuracy. please check

Amarta_Waghani · January 31, 2024, 1:37pm

Hey everyone,

I need some help. I’ve developed a model trained on a housing dataset using logistic regression. However, the accuracy I’m achieving is only 4.96%. Can you help me figure out why it’s not doing better?

here is the code:
Assignments (1).ipynb (364.0 KB)

Dataset:
housing[1].csv (1.1 MB)

chuaal · January 31, 2024, 2:25pm

Is this suppose to be a linear regression problem instead of classification problem? Just my casual guess. I seem to see this problem somewhere b4.

TMosh · January 31, 2024, 4:43pm

From looking at your notebook, you’re trying to predict the “median house value”. That’s a real number. So logistic regression is the wrong method.

Deepti_Prasad · January 31, 2024, 4:51pm

Hello @Amarta_Waghani

based on @TMosh shared pics shows you didn’t show any relation between features and your target variable.

Choosing axis=1 with df.drop basically deletes the whole column.

Regards
DP

Amarta_Waghani · February 1, 2024, 6:07pm

@chuaal, In this dataset, there’s a column containing categorical values, which is why logistic regression is being used instead of linear regression.

TMosh · February 1, 2024, 6:22pm

Having categorical features does not require you use logistic regression. Category features are usually converted to one-hot true (1) / false (0) values.

The key between linear and logistic regression is what is being predicted.

If the output is a real value, then it’s linear regression.
If the output is true/false or a classification, then it’s logistic regression.

Amarta_Waghani · February 6, 2024, 2:29pm

Hi @Deepti_Prasad

In that @TMosh picture, I separated the feature variables by removing the target variable from the dataset. I assigned these features to the ‘x’ variable. On the other hand, I assigned the target variable ‘median_house_value’ to the ‘y’ variable.

Deepti_Prasad · February 6, 2024, 4:25pm

@Amarta_Waghani

Can I know how the housing prices changes with median house value variable?

Also by stating removing the target variable from dataset, you extracting only the particular categorical value from dataset??
If for the above question answer is yes then you do not need to use df.drop, rather call it by df.head, select the defined column(make sure you have removed any null values). then check related of the defined column to the housing price.

Regards
DP

Amarta_Waghani · February 7, 2024, 12:20pm

@Deepti_Prasad,

I believe you might not have seen the entire code. Could you please review the code and let me know what mistakes I might have made?

Thanks!

TMosh · February 7, 2024, 11:13pm

I looked at your notebook again, and the key error is here:

Since you are trying to predict the median_house_value, which is a real floating-point number, you need to use linear regression.

Not logistic regression - that is used for predicting classifications.

Deepti_Prasad · February 9, 2024, 7:51am

Hello @Amarta_Waghani,

I checked your notebook and dataset. can you explain what kind of model you are trying to create as no where you explained in the post what kind of correlation you are creating with your model.

You have used latitude (which is negative variable) and total rooms to get a median housing value (done incorrectly as I cannot see you creating any relation between these variables other than graph showing latitude would not be the right variable to get median housing value.

Next what @TMosh mentioned as your data seemed to be wanting to do logistic regression but you have created linear regression which creating all the issue.

So, kindly first brief us what kind of model are you trying to create based on what features or what you are trying to analyse?

In case you are creating a regression analysis between median housing value and total rooms then try to find what is relation between the two.

My suggestion would be to create relation between median housing age and total rooms to median housing value.(This suggestion is without knowing what you are basically looking for in your model.)

Regards
DP

Amarta_Waghani · February 14, 2024, 11:53am

Hey there,

@Deepti_Prasad @TMosh,

I’ve made some updates to the code. Here’s the description:

Preprocess the data:
- Select the target variable as median_house_value.
- Remove unnecessary columns.
- Perform one-hot encoding on ocean_proximity.
- Implement IQR to remove outliers.
- Implement linear regression.

However, it still gives a root mean square prediction of 0.25.

Assignments (2).ipynb (261.7 KB)

Deepti_Prasad · February 14, 2024, 2:34pm

Hello @Amarta_Waghani

Can you please let me know what kind of model analysis you are trying to build? Why median house value holds importance in your analysis?

As I can see now you have changed your relative analysis from median house value to ocean proximity I am unsure about your analysis perspective.

Regards
DP

TMosh · February 14, 2024, 10:47pm

Looking at the data set, it appears to me that this data set depicts the median house price within georgraphic areas that are identified by a central latitude/longitude point.

The other columns create the X training features as shown below.

So the goal is to predict the median house price as a function of the location and ocean proximity.

This is modeled using linear regression.

That all seems fine.

What seems to be missing is normalizing the data set before training.

Amarta_Waghani · February 24, 2024, 5:26pm

Hi

@TMosh I did try normalization, but unfortunately, it didn’t improve the accuracy of the model. I also experimented with different regressor models, but the results remained unsatisfactory.

@Deepti_Prasad, This assignment involves using a provided dataset to make predictions using regressor models. The target variable for prediction is the median house value.

here is the code :
Assignments (3).ipynb (262.9 KB)

Deepti_Prasad · February 25, 2024, 8:48am

Hello @Amarta_Waghani

So based on this statement, I can consider that the target variable is only important, how many numbers of feature variables you used is not mandatory right?

I think If I was you, I would first go with simplest of linear regression model (for example to find correlation of median house age to median house value), then switch how Tom is telling you to apply the variable feature of latitude or longitude to the target variable.

I saw your updated assignment. The normalization of data is only getting applied to the ocean proximity column, which could be one of the issue with your model.

Why do not you scale your features based on feature parameters defining a fixed value like the maximum value of median house age being a y value for a set of features of x variables which above that scale marked as 1 and below as 0.

Various normalisation technique you could use, either apply max or min parameter (if you have extreme outliers), or log of x features if you can features the x variable on a common scale, or use a z_Score if you do not have any extreme outliers.

if you work on this probably you will see some better model results.

Regards
DP

Amarta_Waghani · February 26, 2024, 12:19pm

Hi @Deepti_Prasad,

I am new to this, so I am making some mistakes.

I’m struggling to figure out how to use math with a dataset. So far, I’ve only used built-in functions. Could you explain to me how to apply math to a dataset?

Thanks!

Amarta_Waghani · February 26, 2024, 3:13pm

Hi

@Deepti_Prasad, @TMosh

Here is the code link that I found on GitHub, where they work on this dataset.

github.com

ellieflgr/CaliforniaHousingPricesML/blob/master/.ipynb_checkpoints/California Housing Prices-checkpoint.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>About this project</h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an educational project workthrough as presented in chapter 2 of Aurelien Gérons' book *Hands On Machine Learning with Scikit-Learn, Keras and Tensor Flow*. \n",
    "\n",
    "It is based on the often-used *California Housing Prices* dataset, which contains California census data from 1990. The scenario from the book is that a company is unhappy with manually estimating housing prices - in order to avoid this effort in the future, they hire a team that is supposed to solve this problem via Machine Learning, and \"we\" are part of that team.\n",
    "\n",
    "I decided to use this opportunity of thoroughly working through a whole project, so instead of just reading it or copying and pasting I typed it down myself to get as much out of it as possible. \n",
    "\n",
    "Thus, it contains plenty of comments, explanations, small excursions and reflections for personal documentation and clarification. Towards the end of the project, you can find my very own experiments with the data starting in section [Experiments](#Experiments).\n",

This file has been truncated. show original

TMosh · February 26, 2024, 6:17pm

To test your code, I need the data file.

I presume the “housing[1].csv” file that your notebook is importing came from the repo as “/datasets/housing/housing.csv”?

TMosh · February 26, 2024, 6:56pm

What is your reason for believing the accuracy of the model can be improved?

Maybe your results are as good as possible for that dataset. Perhaps housing prices can’t be accurately predicted using a linear combination of these features.

Are there some reference results from other people who have worked on this data set?

I recommend you test your code with a simple invented dataset that has a known simple solution (see below). Then you can say whether there is an issue in your code for the model, or whether this is just a difficult dataset to use.

= = = = =

Testing your model code using an invented dataset that has more predictable results:

You could add some code that uses your normalized data, you assign some arbitrary weight values, and compute a set of ‘y’ values you can use for training.

Your model should be able to perfectly re-create these ‘y’ values, and give the same weights as you assigned. The “r2_score” in this case should be very close to 1.0.

Topic		Replies	Views
Logistic Regression Python code not working correctly Neural Networks and Deep Learning coursera-platform	4	633	July 9, 2021
Simple Logistic Regression Week 3.3 Neural Networks and Deep Learning coursera-platform	1	497	February 6, 2022
W2 \| Logistic Vs Linear Regression \| Would you expect a bad model and why? Neural Networks and Deep Learning coursera-platform	26	795	August 8, 2022
C1 W1 Logical error when calculating accuracy NLP with Attention Models week-1	6	343	January 8, 2024
C1_W2_Logistic Regression Video_Doubt about Linear Regression Neural Networks and Deep Learning coursera-platform	6	709	July 22, 2022

My first logistic regression model gives bad accuracy. please check

Related topics