Hi there,
I am currently doing a project on estimating energy consumption for buildings in London. Thus, it is a regression task. I have a question regarding on how we split the data. Usually we will use
-
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42) -
X = np.random.randn(12, 2)
y = np.random.randint(0, 2, 12)
tscv = TimeSeriesSplit(n_splits=3, test_size=2)
my initial data is 1mill yet i am training my data for 10k rnadomly using something like below:
sample_df = df.sample(n=10000, random_state=42)
My question is if no 2 is preferrable since its a yearly/sequence data , does my randomly selected 10k datasets will affect the model performance.
My second question is for data normalization:
- min-max for NN
- Z-score or standard scaler is for SVR,XGBOOST and MLR
Am i defined this correctly ?