DNN model occasionally gets terrible error

So I am playing around with a data set I have to test my knowledge from week 3 of the machine learning specialization.

I am currently plotting a graph of the performance of my model with different amounts of training data. This idea is inspired by the following graph in the optional lab under “bias and variance” section.

I have a function which gets the data from my database according to how many data samples i want. My problem is, every once in a while, as you can see in the following image, the model gets a terrible r^2 score (high error) on both the training set and the validation set.
Here is the example in written, why does the model fail at 5000 training examples, when it works as expected on 4000 and 6000? The model has around 60 input parameters.
(the written example is not from the same run as the following image)
4000 training samples:
Training r2: 0.28
Testing r2: 0.14
5000 training samples:
Training r2: -0.04
Testing r2: -0.05
6000 training samples:
Training r2: 0.23
Testing r2: 0.18

image

EDIT:
Running it again, I only get one such error in my 30 different sizes of training data:
image

I believe that looks like a pretty healthy model except for the error at 5000 training samples. Feel free to give me other thoughts about my model! Do you agree that getting more data would not be helpful in this case? That I should probably try to make a more complex model?
Appreciate any feedback!

There may be an issue in the construction or organization of your data set, or the method you’re using to feed into training or validation.

When you use a learning curve method (where you vary the size of the training set), the order in which the samples are sorted can make a big difference.

The examples should be randomly shuffled, and you may need to do this several times and average the results.

Thanks for the reply!

I do in fact randomize the training examples. This is from my getData-function:
image

This is how I get the data and split it into training and validation (called test):

It strikes me as statistically improbable how it often fluctuates to an r^2 of around 0, but rarely any smaller but significant fluctuations (say down to .10 or .20 in this case). Also, with 26k training examples, it seems very unlikely that the model should get such a low r^2 out of pure chance. And then it is even when predicting on the training set.

Hello @torbjorn_Dahl,

If I compare the 2nd and the 3rd images in your first post of this thread, all “valleys” disappeared in the 3rd image but a new “valley” appears at 5000 there. If you have not changed any code in between these two results, then it means that the results are not quite reproducible, or the results might have some dependence that we have overlooked such that by rerunning the code (without restarting kernel), that dependence changed which leads to the changes in the new results.

If you have changed some code before getting the 3rd image, then you must have discovered some problem that can cause “valleys”, and I think that’s a very important piece of lead. Maybe the problem had not been fully removed?

I don’t think we should expect those “valleys”, but to really pinpoint the problem, I am afraid I will need to see the code myself. Would you mind sharing with me the code (or a minimal reproducible example version of it) and three datasets (the 4000, 5000, and 6000, but they have to be the ones that you had used in making the plot, because I want to reproduce that valley myself).

Cheers,
Raymond

I finally managed to create a reproducible example by saving the shuffled data from all the queries of a run from 1k to 29k training examples that produced the issue in question, and setting the random seed of tensorflow.

However, for whatever reason, the error does not appear when I create and test the model on only the problematic dataset and its adjacent datasets (being the datasets of size 23k, 24k and 25k samples, 24k having the error), but only when I run my model on all the datasets from 1k training examples up to 29k training examples.

Here is the code to reproduce the error, and I am sending you “my_static_data” as a .pickle file, as I am not allowed to attach it here. EDIT: I am not allowed to send it either. I gotta run, but what file format would you prefer?
my_static_data is an array with length 29 and each element is an array with length 4 being x_train, x_test, y_train, y_test. FYI

# Set the random seed
tf.random.set_seed(123)

train_r2s = []
test_r2s = []

for dataset in my_static_data:
  # Get the data, which is the same each time (verified)
  x_train, x_test, y_train, y_test = dataset
  # Scale the data
  scaler = StandardScaler()
  x_train_scaled = scaler.fit_transform(x_train)
  x_test_scaled = scaler.transform(x_test)
  # Create the model
  model = Sequential([
      Dense(30, activation='relu', kernel_regularizer=L2(0.05) ),
      Dense(15, activation='relu', kernel_regularizer=L2(0.05)),
      Dense(3, activation='relu', kernel_regularizer=L2(0.05)),
      Dense(1, activation='linear')
  ])
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mean_squared_error')
  history = model.fit(x_train_scaled, y_train, validation_data=(x_test_scaled, y_test), epochs=20, verbose=0)
  # Predict on train and test set
  y_train_pred = model.predict(x_train_scaled)
  y_test_pred = model.predict(x_test_scaled)
  
  model_train_r2 = r2_score(y_train, y_train_pred)
  model_test_r2 = r2_score(y_test, y_test_pred)
  print(f"Training r2: {model_train_r2.round(2)}")
  print(f"Testing  r2: {model_test_r2.round(2)}")
  train_r2s.append(model_train_r2)
  test_r2s.append(model_test_r2)

plt.figure()
plt.plot(np.arange(len(my_static_data)), train_r2s)
plt.plot(np.arange(len(my_static_data)), test_r2s)
plt.legend(['Training r2', 'Validation r2'])

If you run it, it should look like this

pickle is fine. If you zip the pickle the file, can you send it to me via a direct message (click my profile and hit message)?

Raymond

Just 2 quick questions

  1. if you change the seed value to 234, would the error shift?

  2. what are the shapes for x_train and x_test?

I have sent a google drive link to the pickle file on DM.

  1. Yes, the error shifts with another seed value. (Note, from my experience there may be from zero to several such errors)
  2. Here are the shapes of the different sizes of x_train and x_test:

Index: 0
x_train Shape: (800, 63)
x_test Shape: (200, 63)
Index: 1
x_train Shape: (1600, 63)
x_test Shape: (400, 63)
Index: 2
x_train Shape: (2400, 63)
x_test Shape: (600, 63)
Index: 3
x_train Shape: (3200, 63)
x_test Shape: (800, 63)
Index: 4
x_train Shape: (4000, 63)
x_test Shape: (1000, 63)
Index: 5
x_train Shape: (4800, 63)
x_test Shape: (1200, 63)
Index: 6
x_train Shape: (5600, 63)
x_test Shape: (1400, 63)
Index: 7
x_train Shape: (6400, 63)
x_test Shape: (1600, 63)
Index: 8
x_train Shape: (7200, 63)
x_test Shape: (1800, 63)
Index: 9
x_train Shape: (8000, 63)
x_test Shape: (2000, 63)
Index: 10
x_train Shape: (8800, 63)
x_test Shape: (2200, 63)
Index: 11
x_train Shape: (9600, 63)
x_test Shape: (2400, 63)
Index: 12
x_train Shape: (10400, 63)
x_test Shape: (2600, 63)
Index: 13
x_train Shape: (11200, 63)
x_test Shape: (2800, 63)
Index: 14
x_train Shape: (12000, 63)
x_test Shape: (3000, 63)
Index: 15
x_train Shape: (12800, 63)
x_test Shape: (3200, 63)
Index: 16
x_train Shape: (13600, 63)
x_test Shape: (3400, 63)
Index: 17
x_train Shape: (14400, 63)
x_test Shape: (3600, 63)
Index: 18
x_train Shape: (15200, 63)
x_test Shape: (3800, 63)
Index: 19
x_train Shape: (16000, 63)
x_test Shape: (4000, 63)
Index: 20
x_train Shape: (16800, 63)
x_test Shape: (4200, 63)
Index: 21
x_train Shape: (17600, 63)
x_test Shape: (4400, 63)
Index: 22
x_train Shape: (18400, 63)
x_test Shape: (4600, 63)
Index: 23
x_train Shape: (19200, 63)
x_test Shape: (4800, 63)
Index: 24
x_train Shape: (20000, 63)
x_test Shape: (5000, 63)
Index: 25
x_train Shape: (20800, 63)
x_test Shape: (5200, 63)
Index: 26
x_train Shape: (21600, 63)
x_test Shape: (5400, 63)
Index: 27
x_train Shape: (22400, 63)
x_test Shape: (5600, 63)
Index: 28
x_train Shape: (23200, 63)
x_test Shape: (5800, 63)

Hello @torbjorn_Dahl,

I cannot reproduce the same plot as you have shared. To help us continue this, I am making the following suggestions.

  1. generally speaking, if we want to study the effect of training set, we want to make sure that, among all models, only the training set is changed so that we can credit all performance difference to the training sets. I have therefore modified your code to:-

    • set seed inside the loop so that all models initialized the same way (and you need to verify it)
    • test all trained models with the same test set

    Modified code included in the end

  2. I limited to use only the first 14 datasets for fast results, and they are REPRODUCIBLE. Although no sudden drops are seen here, please repeat your test with more datasets and let me know if the drops happen again

  3. if we will exchange code in the future, please also include the import lines :wink: I had to assume you were using sklearn’s r2_score in order to start testing right the way.

  4. sklearn’s r2_score will force the result to zero in special scenario (explained in doc). I have therefore added mse as another reference score. Let’s see if you will find both to change drastically at the same time or not.

  5. finally, I need you to redefine a test set that is large AND never overlapped with any training set. I suggest you to first divide your full dataset into a train and a test set. Then you get from the train set some subsets of different sizes to continue your experiements.

After you implement all of my suggestions, you will have a code that is REPRODUCIBLE (!! important). If you spot a drop again, then I can run your latest code on my machine to look into that.

Cheers,
Raymond

import pickle
import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import L2
from sklearn.metrics import r2_score, mean_squared_error ### CHANGED
from sklearn.preprocessing import StandardScaler

import random
import numpy as np

def set_seed(seed=100):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    
with open('data.pickle', 'rb') as f:
    my_static_data = pickle.load(f)

# Set the random seed
# tf.random.set_seed(123) ### CHANGED

# train_r2s = [] ### CHANGED
# test_r2s = [] ### CHANGED
scores = [] ### CHANGED

_, x_test, _, y_test = my_static_data[0] ### CHANGED, the standard test set

for dataset in my_static_data[:14]: ### CHANGED
  set_seed(100) ### CHANGED
  # Get the data, which is the same each time (verified)
  x_train, _, y_train, _ = dataset ### CHANGED
  print('training set size', x_train.shape) ### CHANGED
  # Scale the data
  scaler = StandardScaler()
  x_train_scaled = scaler.fit_transform(x_train)
  x_test_scaled = scaler.transform(x_test)
  # Create the model
  model = Sequential([
      Dense(30, activation='relu', kernel_regularizer=L2(0.05) ),
      Dense(15, activation='relu', kernel_regularizer=L2(0.05)),
      Dense(3, activation='relu', kernel_regularizer=L2(0.05)),
      Dense(1, activation='linear')
  ])
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mean_squared_error')
  history = model.fit(x_train_scaled, y_train, validation_data=(x_test_scaled, y_test), epochs=20, verbose=0)
  # Predict on train and test set
  y_train_pred = model.predict(x_train_scaled)
  y_test_pred = model.predict(x_test_scaled)

  ### all CHANGED below
  scores.append([
    r2_score(y_train, y_train_pred),
    r2_score(y_test, y_test_pred),
    mean_squared_error(y_train, y_train_pred),
    mean_squared_error(y_test, y_test_pred)
  ])
  
  # model_train_r2 = r2_score(y_train, y_train_pred)
  # model_test_r2 = r2_score(y_test, y_test_pred)
  # print(f"Training r2: {model_train_r2.round(2)}")
  # print(f"Testing  r2: {model_test_r2.round(2)}")
  # train_r2s.append(model_train_r2)
  # test_r2s.append(model_test_r2)

# plt.figure()
# plt.plot(np.arange(len(my_static_data)), train_r2s)
# plt.plot(np.arange(len(my_static_data)), test_r2s)
# plt.legend(['Training r2', 'Validation r2'])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))
ax1.plot(np.array(scores)[:,:2])
ax2.plot(np.array(scores)[:,2:])
ax1.legend(['Training r2', 'Validation r2'])
ax2.legend(['Training mse', 'Validation mse'])
plt.show()

Oh, I didnt ask you the version of your python, tensorflow and numpy. What are they?

Interesting, it seems to be related to the placement of the set_seed.
In the code you provided, if you move the set_seed out of the for loop, and change the seed number to 124, you will get the error at 8k.
It doesnt matter which test set is used - the error appears either way.

Sorry about the import statements :sweat_smile: Will remember that for next time.

Versions:
Python 3.8
Numpy 1.21.6
Tensorflow 2.9.2

This seems to take some investigations. I can find one such case, and when I plot the error vs label graphs (left for training set, and right for test set), I find that the error spread is particularly small for the “drop” case but not for the “normal” cases ( I looked at a couple of normal cases). What do you think? Anything you have tried to investigate this?

Raymond

  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))
  ax1.scatter(y_train, y_train - y_train_pred.flatten(), s=0.5)
  ax2.scatter(y_test, y_test - y_test_pred.flatten(), s=0.5)
  ax1.set_ylim(-10,15)
  ax2.set_ylim(-10,15)
  plt.show()

Interesting!
I think I can accept it having to do with the seed placement for now, keep practicing and return to this if it continues to be a problem.

Thank you very much for helping me out.

OK. It does worth more investigations but it’s your call :slight_smile:

And, just one last point, even though the seed placement lets you see the drop, but placing it inside the loop is for the sake of initializing the neural networks to the same set of weights for fair experiments. I understand your concern is more about the drops which we decided to pause investigating, but the placement of seeds has its concern as well.

Good luck!

Raymond