# Train error vs validation error

I have some several datasets for which I have error on training data higher then the error on validation data when the ratio of validation data size/train data size = 0.5.
Can there be any error on this or it could be the case?

Also, how generally the train and validation errors vary when the ratio of validation data size/train data size changes?
Will be grateful to you for the answer,
Vasyl.

1 Like

Can you share the size of the dataset you are mentioning. you havenât given any detail about the data you are using and how you have divided them between training and validation data for that to get an error of 0.5

3. you ratio of data between training and validation set
4. what are you trying to build (detail about your model algorithm)

Regards
DP

Thank you very much for the answer!
Here is the dataset I am working with:
carsmall1.csv (1.8 KB)
I am practicing in linear multivariate regression and want to understand if I properly implement it.

In this particular case I am analyzing prediction of âHorsepowerâ based on âMPGâ:
y = np.array(data[âHorsepowerâ])
x = np.column_stack((np.array(data[âMPGâ])))

Here is the partition to train/test data:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=0,test_size=.5)

What I get is:
w=[[-4.89933215]], b=[[224.46535001]]
RMS train: 29.19290645065467
RMS test: 24.890185673909624

And RMStest remain to be less than RMStrain if I add additional features.
Although the situation may change when I use another random seed for train/test splitter.
Please kindly let me know if I sufficiently described the case.

1 Like

Can I know why are you using multivariate regression analysis? as you are stating you are analysis prediction of Horsepower based on MPG.

So basically there one dependent variable(Horsepower) to one independent variable(MPG)?

Also when I asked about the split, I wanted to know data split between train and validation data and not train test split as you asked or stated

Also can I know how you have correlated Horsepower with MPG?

as I have an understanding horsepower is a major factor in a vehicleâs fuel consumption. More power generally means higher fuel consumption. So what computation you have applied for this analysis

Thank you again for the attention for my question:

Here is the code I use:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

y = np.array(data[âHorsepowerâ])
x = np.array(data[âMPGâ]);x=x.reshape(-1,1)

x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=123,test_size=.5)

model=LinearRegression()
model.fit(x_train, y_train)
y_train1=model.predict(x_train)
y_test1=model.predict(x_test)
print(f"RMS train: {np.sqrt(np.mean((y_train-y_train1)**2))}â)
print(f"RMS test : {np.sqrt(np.mean((y_test-y_test1)**2))}â)

1 Like

The question you have to answer is, is this difference statistically significant?

1 Like

Hello,

From what I can understand from your code, you have taken mean of the columns where you reshaped the mpg (the reason you need to tell me). Then you labelled - divided them into training and validation set which you randomly selected (what is that test_size=0.5??)

you said you did multivariate linear regression analysis but based on your codes it looks like simple linear regression where an x independent variable is being using to give y dependent variable.

After reviewing your carsmall1.csv, I noticed you have 4 columns of acceleration, horsepower, mpg and weight.

Can I know what is this weight? weight of the car?

are you using these variable acceleration, weight and mpg to define correlation to the horsepower?? if yes then it become multiple regression analysis but not multivariate regression analysis. How are you trying to put the relation between these variable. that x_train is not defined correctly in your study as per the data given.

Multivariate regression mean you are using multiple independent variables to get multiple dependent variables, which in your analysis doesnât fit in.

Regards
DP

Yes, in fact I am using multiple (not multivariate) linear regression.

Yes, the weight is the weight of the car.
test_size=0.5 means that half of the data are used for training and another half - for testing.

My best regards,
Vasyl.

Yes, I see. There is not enough data here for the experiment to be statistically significant.

Actually vasly what you could do is create an analysis between only mpg and horsepower.

Then create analysis between horsepower and mpg including only weight of the car.

Then create analysis between horsepower and mpg including acceleration and weight of the car.

See how the analysis differ in all the 3 analyses, then do a regression analysis using the best fitted factor, which in your case I do feel acceleration does matter when it comes to horsepower and mpg (I am not sure about weight of the car as I have very less idea about automobiles)

You can try this in your model algorithm to check how your model is doing (I just came across this module)

This is another issue do not split data 50-50, rather go for 80-20 or 70-30 between training and validation.

Also as Tom mentioned dataset should be there enough for statistical analysis. You can always try to get data from the same resource you have the data now.

Regards
DP

1 Like

Thatâs not what I meant. I suggested that maybe the difference in those two cost values wasnât significant.

1 Like

Yes, however, I was a little disturbed that the training error was more than the testing one.

Small differences donât matter. If you shuffled the data set and split into different training and test sets, youâd get different results every time.

You only learn whether âsmall differenceâ is âimportant differenceâ as you gain experience.

Thank you!

Thank you for the attention to my question!