Estimations for Energy Consumption in Domestic and Public Buildings in London Using Machine Learning

Hi there,

I am currently doing a project on estimating energy consumption for buildings in London. Thus, it is a regression task. I have a question regarding on how we split the data. Usually we will use

  1. X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

  2. X = np.random.randn(12, 2)
    y = np.random.randint(0, 2, 12)
    tscv = TimeSeriesSplit(n_splits=3, test_size=2)

my initial data is 1mill yet i am training my data for 10k rnadomly using something like below:
sample_df = df.sample(n=10000, random_state=42)

My question is if no 2 is preferrable since its a yearly/sequence data , does my randomly selected 10k datasets will affect the model performance.

My second question is for data normalization:

  1. min-max for NN
  2. Z-score or standard scaler is for SVR,XGBOOST and MLR
    Am i defined this correctly ?

You aren’t using a validation set? That’s puzzling.

I don’t agree with your normalization categories. NN’s are not limited to min-max.

i see i did since my computer crash if run more than 30k dataset so i just run on 10k dataset from 1 mill dataset. i am just confused either to use normal train_test_split or TimeSeriesSplit as my objective is a regression task for a time series data but cannot use LSTM that have window size as the data is collected only within apr-june yearly not like daily or monthly

and from what i understand for normalization there few aspects to consider on what types of normalization need to use such as:
-min max
-z-score (standard scaler)
-log scale