Running kerastuner locally with same code = significantly lower accuracy

I have just copied and pasted the code from the ’ Ungraded Lab: Intro to Keras Tuner’ and ran it locally on my Mac (M2) a few times and every time I get a validation accuracy of the last (10th) epoch at around 0.82, whereas the Google Colab example of this ’ Ungraded Lab: Intro to Keras Tuner’ achieves 0.88.

Same goes for a baseline model at the beginning of this lab. I consistently get a significantly lower score locally.

By looking at how the loss changes with each epoch I can see that it immediately starts growing after the first epoch!!

I have the same tensorflow 2.14.0 as in the cloud lab environment.

Can anyone explain what might be a reason for that!

2 Likes

Model weights are randomly initialized. It’s common to see a slight difference in performance across different architectures. It’s best to run your experiments on a machine close to the deployment setup.

Do see enable_op_determinism which will help with reproducibility on the same hardware.

This means that if an op is run multiple times with the same inputs on the same hardware,

1 Like

Mmm… @Danil_Z well, it might be

  • Yes, the way the weights are initialized. Indeed these are random. Try what @balaji.ambresh suggested, it is definitely one start. There is also a seed you can use.
  • due to the architecture. In Colab, is running using TPUs or Nvidia GPUs. Not sure exactly if your M2 installation is using the “neural engine/Core ML/Metal” or not. In any case, Nvidia GPUs and M2 might have precision differences, FP32, FP16, FP64, tensor cores have mixed precisions, meaning that they can do operations in FP16 and accumulate results in FP32 to get better accuracy. Not sure about M2 chip (ARM and CoreML/Metal)
    It depends on how Keras/Tensorflow is interfacing with the hardware, with the GPU… or the M2 either the CPU or the Neural Engine (not sure how well it is implemented for this hardware) You might need more epochs to get the same accuracy.

Then you have intel… anyhow, kind of same story, but with longer history and I would think you would get similar results. But just you to know, that computations, carry accumulation and precision issues. Depends on the implementation of the libraries.

You have to think that keras is a high-level API, that runs over a Tensorflow backend (C++ and Python). C++ provides access to lower level layers or APIs. In the end, it might use some pre-compiled CUDA libraries, and cuDNN, and some just in time JIT compilations.

I would guess you should reach to similar results, unless you cannot reach the same precision, and it might be problem dependent. I would try training longer, and also check the memory. and epoch size… and paste your results here. It is very interesting.

Just FYI, I would assume that M2 might raise more errors, just because the architecture appeared in 2020, and might not be fully adapted yet, but I do not know any specifics. Apple says in their presentations that they can be used to be large models, so I assume they have a considerable task force to make it work in M2.

Please paste your results, and code if it is possible (It might not be allowed, not sure, since it is ungraded it might be allowed, read carefully before you do). i am sure others will paste their results, and answer your question with some more knowledge.

Just to be clear…
I tried to point out that there are differences behind the scenes. If you do similar simulations in different platforms you could see in detail. There is a lot of number crunching behind a DL training. As I mentioned, I would expect similar results, definitely not 10 points of difference! I do not know that package, but it is basically tuning hyperparameters.

My guess, is that the data when @Danil_Z pointed the problem, might be different splits, different inputs in the training (sizes), and initialization, so as you pointed out, fixing a seed is important. I would guess that. that would be my guess. He is mentioning a library which is literally randomly or orderly searching. So I would point that out first.

I am not aware there are difference between platforms in the results, besides the time to do the training, but I think it is possible. Keep in mind that we are talking about trainings. Not a deployment.

Thank you all for input and different ideas.

Weights initialization, seed, different silicon architectures are all valid points for a subtle difference. But what I was getting was a 10% accuracy drop consistently. And I was running the same exact code on the same exact data.

I have noticed tensorflow accuracy drop across all the previous benchmarks I was doing. I’m not sure that my M2 Mac is the reason or the TF version for Mac.

I have tried all the ‘hello-world’ open-source datasets where all the scores are well known (MNIST, FASHION MNIST, …) and noticed that my results are significantly lower - this drove me crazy half a day.

The solution was unexpected. I have just changed the activation function of the hidden layers from relu to tanh and all went back to normal…

Here is the code that kept me up all night.
It really is just basic boilerplate for playing around with fashion-mnist.\

from tensorflow import keras

# Download the dataset and split into train and test sets
(img_train, label_train), (img_test, label_test) = keras.datasets.fashion_mnist.load_data()

# Normalize pixel values between 0 and 1
img_train = img_train.astype('float32') / 255.0
img_test = img_test.astype('float32') / 255.0

# Number of training epochs.
NUM_EPOCHS = 10

# ------------------ Baseline Model ------------------ #
# Build the baseline model using the Sequential API
b_model = keras.Sequential()
b_model.add(keras.layers.Flatten(input_shape=(28, 28)))
b_model.add(keras.layers.Dense(units=512, activation='tanh', name='dense_1')) # You will tune this layer later
b_model.add(keras.layers.Dropout(0.2))
b_model.add(keras.layers.Dense(10, activation='softmax'))

# Print model summary
b_model.summary()

# Setup the training parameters
b_model.compile(optimizer=keras.optimizers.legacy.Adam(learning_rate=0.001),
            loss=keras.losses.SparseCategoricalCrossentropy(),
            metrics=['accuracy'])

# Train the model
b_model.fit(img_train, label_train, epochs=NUM_EPOCHS, validation_split=0.2)

Changing relu to tanh is usually not done as part of hyperparamter tuning of hidden dense layers. relu and its variants are used as hidden layer activation functions for dense layers due to speed.

Could you please try this?

  1. In b_model.fit
    a. Set shuffle=False
    b. Since there are 60000 training datapoints, set batch_size = 30
    c. Remove validation_split=0.2
  2. On colab, train the model for 1 epoch and save model weights using save_model.
  3. Download the model weights to your M2 machine and train the model for 9 more epochs after loading the weights from colab via load_model. Set initial_epoch to 1 in model.fit to reduce training to 9 epochs instead of 10.
  4. Let the model run for 9 more epochs on colab as well.

Share the histories.

Assuming that M1 and colab are running the same version of tensorflow, keras and numpy:

  1. If histories of the last 9 epohs are identical, initialization is the only difference and so tanh might be a good choice if you want to use M2 for production purpose.
  2. If that’s not the case, there is likely a problem with M2 tensorflow code.

Hi Balaji,

I have done some research on the problem and discovered that the problem is tensorflow-metal package for MacOS which enables GPU support for tensorflow.

Running the same code on the same local MacBook in the environment without tensorflow-metal indeed fixes the problem - all validation metrics are as expected.

Thank you for all your input, the issue can be closed.

1 Like

Thanks for confirming, Danil.