GPU 0% usage on Kaggle even when I activate it

I have been training a neural network on Kaggle. To make the training faster, I enabled the GPU provided by the platform, but when I train the network, the GPU usage is at 0% and the training is very slow (3 s/step, and 5 minutes per epoch, approximately, which is very slow. This didn’t happen to me before).

Additionally, in the logs, the following appears:

  • "Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered.
  • Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered.
  • Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered: All log messages before absl::InitializeLog() is called are written to STDERR -successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero.
  • Skipping the delay kernel, measurement accuracy will be reduced"

Also, I placed the model training stage at

if tensorflow.config.list_physical_devices('GPU'):
 print("GPU available")
 with tensorflow.device('/device:GPU:0'):
#Compile step
model.compile(
optimizer=Adam(learning_rate=0.001), #It depends on how the values ​​go in the training, I will change the values
loss="binary_crossentropy", #Well it is binary classification
metrics=["accuracy"]
)

#Training step
history=model.fit(
train_data,
validation_data=val_data,
epochs=10,
callbacks=[ES]
)

and I previously activated the GPU hardware on the platform, but the GPU still has 0% usage.

The data I use only consists of 3388 images in training, and before training the model I resized the images to 128x128. The pretrained model is VGG16. I tried to create another smaller network, but it’s still super slow to train, and the GPU has 0% usage even when I activate it.

Why is the GPU usage in Kaggle still 0%, which is reflected in the training (3s/step), if I activated the GPU and performed the training in the scope?

you are training your model on kaggle notebook right??

in the kaggle if you go to accelerator section, you should find the gpu option. You need to select the processor option before you train the model, otherwise it will use kaggle assigned cpu limit by your system only, hence slower training

kaggle gpu/tpu limit is 30 hour per week for each type of processor.

  • I activated the gpu before training the model, that’s why it prints “GPU disponible (GPU available)” when I train the model.

  • Yes, Kaggle’s limit is 30 hours per week for each type of processor. In my case, the following appears:
    "Your accelerator quota:

    • GPU 22h 37m available of 30h
    • TPU 19h 42m available of 20h"

That means I can still use the GPU, right? But again, I rebooted the kernel, turned on the gpu, trained the model and the training is still taking too long, and I already tried with a smaller network, but it’s still slow.


What seems strange to me are the messages in the logs, they did not appear before. Those messages appeared indicating those errors and the training is now too slow, even when I am using the GPU, I also tried with TPU, but it didn’t work for me.

I am training the network again, but the training is still very slow. More than 1 hour of training when there is little data and the network has few trainable parameters and I activated the GPU.

can you paste that warning msg after 1st epoch training?

For a model to train slow gpu or cpu is not the only reason to train slow, there could be other reasons, your dataset, model or even if you are working on 3000 images, is your system capable of doing that training?

Sure, here are the messages:

  • Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered.
  • Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered.
  • Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  • successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero.
  • Skipping the delay kernel, measurement accuracy will be reduced.

By the way, I used object data generators to manage and store the data. But I always did it and it didn’t give me any problems (batch_size=32). However, these errors have appeared recently, not only in this dataset, but others, and I don’t know why. :frowning:

@DIANA7
I was asking for that MSG pop after epoch 1, doesn’t seem to matching to what you are stating?

by the way

where are you getting these messages? You didn’t share for which code you got these errors? screenshot!

Perhaps, and I am not too knowledgeable of Kaggle, the data you are using is not proper for a GPU. Normally, you would use a GPU for images or matrix arranged data!

Sure. I share the screenshots about the code:


(Data is stored in batches)


Model:


Trainable parameters:


Callback EarlyStopping:


Training:


I found the following on internet:

    1. The ImageDataGenerator is not returning both the original data and the transformed data — the class only returns the randomly transformed data.
    1. We call this “in-place” and “on-the-fly” data augmentation because this augmentation is done at training time (i.e., we are not generating these examples ahead of time/prior to training).

The size of the data images that I’m using is “6000x4000”. I don’t know if that’s a possible reason why the training is very slow, since “ImageDataGenerator is a batches Generator of tensor image data with real-time data augmentation”, but I’m not sure.

That was always the way I loaded and stored data (1024x1024 images, and so on), but that didn’t happen to me.

To load and store data, I’m using ImageDataGenerator. I found the following on internet:

  • The ImageDataGenerator produces batches of real-time enhanced tensor image data . Each training image can have any random alteration applied to it when it is sent to the model.

I don’t know if that’s a possible reason why the training is very slow, since “ImageDataGenerator is a batches generator of tensor image data with real-time data augmentation”, but I’m not sure.

That was always the way I loaded and stored data (1024x1024 images, and so on), but that didn’t happen before.

However long it took to process your 1K x 1K images, the 6K x 4K images will take 24 times longer.

The preprocessing of the data, then, should be before it is passed to the model? Because, I believe Keras’s ImageDataGenerator applies the transformations to the images as the model is trained.
However, I removed the transformations such as rotation, flip, and only did the resize and rescaling for the images (without applying additional processing) in ImageDataGenerator, but the training time remains the same. :frowning:

@DIANA7

can you confirm me once if you followed the same steps?

How to connect kaggle notebook to gpu/tpu

i actually feel your GPU is not getting connected to kaggle.

I found a notebook regarding this, see if it helps you

however I have come across information that kaggle works slower than colab.

Thanks! I’ll try it :slight_smile:

1 Like

let me know what happens once you check

Hi, I followed the steps indicated in the notebooks. However, when I save and run the code to save the notebook to Kaggle, it takes the same amount of time and I’m waiting for over two hours :frowning:

@DIANA7

You will have to report or raise query in kaggle for issue related to kaggle notebook. it could be system configuration too creating an issue. also remember as I told kaggle do work slowly, the amount of each epoch training 20 *3000 set of images is a higher for an usual laptop or system you might be working.

Also for the other log info you shared

This is system warning, I found a link pertaining to this

It provides information that this log information has not error but warning that your system might have single cpu socket for the person who encountered similar info

“Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be ‘0’ (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!”

So check your system configuration.

Thank you very much for the help, really! :slight_smile:

1 Like