# Optional Lab: Linear Regression using Scikit-Learn

hi all,

here is a code segment from the lab that I have problems understanding.
how do skilearn understand which colon to assign to x_train and y_train?
why is X_features defined in the code? I do not see any place where it is later used.
I also run the code without the X_features variable and it worked and output the same results.

X_train, y_train = load_house_data()

X_features = [‘size(sqft)’,‘bedrooms’,‘floors’,‘age’]

scaler = StandardScaler()

X_norm = scaler.fit_transform(X_train)

sgdr = SGDRegressor(max_iter=1000)

sgdr.fit(X_norm, y_train)

b_norm = sgdr.intercept_

w_norm = sgdr.coef_

y_pred_sgd = sgdr.predict(X_norm)

make a prediction using w,b.

y_pred = np.dot(X_norm, w_norm) + b_norm

Hi @mehmet_baki_deniz

The utility function load_house_data() returns two outputs, and the first is X_train, and the second is y_train. You can take a look at the source code of load_house_data() by clicking the File ->open->lab_utils_multi.py. The function is located near the bottom of the file.
You can see the file lab_utils_multi.py is linked to this notebook at the top of the import statement.
X_features is a variable used for the visual display at the Plot results section, it is not for training the model. You can find how it is being used at the end where the predictions and targets are plotted against the original features.

1 Like

thank you very much for the response

In the code segment you provided, load_house_data() is a function that loads the training data for the house price prediction problem. The function returns a tuple containing two arrays, X_train and y_train, where X_train is a 2D array of features and y_train is a 1D array of labels.

The X_features variable is defined as a list of strings, but it is not used in the code. It is likely that this variable was intended to be used as a list of feature names, but it is not necessary for the code to run. You can remove the X_features variable and the code should still work and produce the same results.

The StandardScaler class from scikit-learn is used to normalize the training data by subtracting the mean and dividing by the standard deviation of each feature. This is done to make the data distribution more symmetrical and improve the performance of the regression model. The fit_transform() method is used to fit the scaler to the training data and transform the data, so that each feature has a mean of 0 and a standard deviation of 1. The transformed data is stored in the X_norm variable.

The SGDRegressor class from scikit-learn is used to train a stochastic gradient descent (SGD) regression model on the normalized training data. The max_iter parameter specifies the maximum number of iterations to run the SGD algorithm. The fit() method is used to train the model on the training data, and the intercept_ and coef_ attributes are used to retrieve the model’s intercept and coefficients, respectively. The predict() method is used to make predictions on the training data, and the predictions are stored in the y_pred_sgd variable.

Finally, the dot product of the normalized training data and the model coefficients is computed, and the intercept is added to the result to make a prediction using the model’s parameters. This prediction is stored in the y_pred variable.

1 Like

thank you very much for your detailed response