Could you please tell your advise:
I have some several datasets for which I have error on training data higher then the error on validation data when the ratio of validation data size/train data size = 0.5.
Can there be any error on this or it could be the case?
Also, how generally the train and validation errors vary when the ratio of validation data size/train data size changes?
Will be grateful to you for the answer,
Vasyl.
Can you share the size of the dataset you are mentioning. you havenât given any detail about the data you are using and how you have divided them between training and validation data for that to get an error of 0.5
Try to give details about
what is your analysis about?
your dataset detail
you ratio of data between training and validation set
what are you trying to build (detail about your model algorithm)
information about your model fit, like what optimizer used?
Thank you very much for the answer!
Here is the dataset I am working with: carsmall1.csv (1.8 KB)
I am practicing in linear multivariate regression and want to understand if I properly implement it.
In this particular case I am analyzing prediction of âHorsepowerâ based on âMPGâ:
y = np.array(data[âHorsepowerâ])
x = np.column_stack((np.array(data[âMPGâ])))
Here is the partition to train/test data:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=0,test_size=.5)
What I get is:
w=[[-4.89933215]], b=[[224.46535001]]
RMS train: 29.19290645065467
RMS test: 24.890185673909624
And RMStest remain to be less than RMStrain if I add additional features.
Although the situation may change when I use another random seed for train/test splitter.
Please kindly let me know if I sufficiently described the case.
Can I know why are you using multivariate regression analysis? as you are stating you are analysis prediction of Horsepower based on MPG.
So basically there one dependent variable(Horsepower) to one independent variable(MPG)?
Also when I asked about the split, I wanted to know data split between train and validation data and not train test split as you asked or stated
So here I am little confused!!! about this incomplete information.
Also can I know how you have correlated Horsepower with MPG?
as I have an understanding horsepower is a major factor in a vehicleâs fuel consumption. More power generally means higher fuel consumption. So what computation you have applied for this analysis
Thank you again for the attention for my question:
Here is the code I use:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv(âcarsmall1.csvâ)
y = np.array(data[âHorsepowerâ])
x = np.array(data[âMPGâ]);x=x.reshape(-1,1)
From what I can understand from your code, you have taken mean of the columns where you reshaped the mpg (the reason you need to tell me). Then you labelled - divided them into training and validation set which you randomly selected (what is that test_size=0.5??)
you said you did multivariate linear regression analysis but based on your codes it looks like simple linear regression where an x independent variable is being using to give y dependent variable.
After reviewing your carsmall1.csv, I noticed you have 4 columns of acceleration, horsepower, mpg and weight.
Can I know what is this weight? weight of the car?
are you using these variable acceleration, weight and mpg to define correlation to the horsepower?? if yes then it become multiple regression analysis but not multivariate regression analysis. How are you trying to put the relation between these variable. that x_train is not defined correctly in your study as per the data given.
Multivariate regression mean you are using multiple independent variables to get multiple dependent variables, which in your analysis doesnât fit in.
Actually vasly what you could do is create an analysis between only mpg and horsepower.
Then create analysis between horsepower and mpg including only weight of the car.
Then create analysis between horsepower and mpg including acceleration and weight of the car.
See how the analysis differ in all the 3 analyses, then do a regression analysis using the best fitted factor, which in your case I do feel acceleration does matter when it comes to horsepower and mpg (I am not sure about weight of the car as I have very less idea about automobiles)
You can try this in your model algorithm to check how your model is doing (I just came across this module)
This is another issue do not split data 50-50, rather go for 80-20 or 70-30 between training and validation.
Also as Tom mentioned dataset should be there enough for statistical analysis. You can always try to get data from the same resource you have the data now.
Small differences donât matter. If you shuffled the data set and split into different training and test sets, youâd get different results every time.
You only learn whether âsmall differenceâ is âimportant differenceâ as you gain experience.