GPU 0% usage on Kaggle even when I activate it

DIANA7 · January 4, 2025, 8:54pm

I have been training a neural network on Kaggle. To make the training faster, I enabled the GPU provided by the platform, but when I train the network, the GPU usage is at 0% and the training is very slow (3 s/step, and 5 minutes per epoch, approximately, which is very slow. This didn’t happen to me before).

Additionally, in the logs, the following appears:

"Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered.
Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered.
Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered: All log messages before absl::InitializeLog() is called are written to STDERR -successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero.
Skipping the delay kernel, measurement accuracy will be reduced"

Also, I placed the model training stage at

if tensorflow.config.list_physical_devices('GPU'):
 print("GPU available")
 with tensorflow.device('/device:GPU:0'):
#Compile step
model.compile(
optimizer=Adam(learning_rate=0.001), #It depends on how the values go in the training, I will change the values
loss="binary_crossentropy", #Well it is binary classification
metrics=["accuracy"]
)

#Training step
history=model.fit(
train_data,
validation_data=val_data,
epochs=10,
callbacks=[ES]
)

and I previously activated the GPU hardware on the platform, but the GPU still has 0% usage.

The data I use only consists of 3388 images in training, and before training the model I resized the images to 128x128. The pretrained model is VGG16. I tried to create another smaller network, but it’s still super slow to train, and the GPU has 0% usage even when I activate it.

Why is the GPU usage in Kaggle still 0%, which is reflected in the training (3s/step), if I activated the GPU and performed the training in the scope?

Deepti_Prasad · January 4, 2025, 9:38pm

you are training your model on kaggle notebook right??

in the kaggle if you go to accelerator section, you should find the gpu option. You need to select the processor option before you train the model, otherwise it will use kaggle assigned cpu limit by your system only, hence slower training

kaggle gpu/tpu limit is 30 hour per week for each type of processor.

DIANA7 · January 4, 2025, 10:10pm

I activated the gpu before training the model, that’s why it prints “GPU disponible (GPU available)” when I train the model.
Yes, Kaggle’s limit is 30 hours per week for each type of processor. In my case, the following appears:
"Your accelerator quota:
- GPU 22h 37m available of 30h
- TPU 19h 42m available of 20h"

That means I can still use the GPU, right? But again, I rebooted the kernel, turned on the gpu, trained the model and the training is still taking too long, and I already tried with a smaller network, but it’s still slow.

–
What seems strange to me are the messages in the logs, they did not appear before. Those messages appeared indicating those errors and the training is now too slow, even when I am using the GPU, I also tried with TPU, but it didn’t work for me.

DIANA7 · January 4, 2025, 10:20pm

I am training the network again, but the training is still very slow. More than 1 hour of training when there is little data and the network has few trainable parameters and I activated the GPU.

Deepti_Prasad · January 4, 2025, 10:55pm

can you paste that warning msg after 1st epoch training?

For a model to train slow gpu or cpu is not the only reason to train slow, there could be other reasons, your dataset, model or even if you are working on 3000 images, is your system capable of doing that training?

DIANA7 · January 4, 2025, 11:07pm

Sure, here are the messages:

Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered.
Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered.
Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero.
Skipping the delay kernel, measurement accuracy will be reduced.

By the way, I used object data generators to manage and store the data. But I always did it and it didn’t give me any problems (batch_size=32). However, these errors have appeared recently, not only in this dataset, but others, and I don’t know why.

Deepti_Prasad · January 5, 2025, 6:21am

@DIANA7
I was asking for that MSG pop after epoch 1, doesn’t seem to matching to what you are stating?

by the way

where are you getting these messages? You didn’t share for which code you got these errors? screenshot!

gent.spah · January 5, 2025, 8:04am

Perhaps, and I am not too knowledgeable of Kaggle, the data you are using is not proper for a GPU. Normally, you would use a GPU for images or matrix arranged data!

DIANA7 · January 5, 2025, 1:58pm

Sure. I share the screenshots about the code:

Image Data Generator to load and store the data:

DG1022×602 39.8 KB

(Data is stored in batches)

Model:

Pretrained model VGG16 :

T1990×567 46.7 KB
Adding dense layers to convolutional base to VGG16 model:

t2971×423 48.5 KB

Trainable parameters:

Callback EarlyStopping:

Training:

I found the following on internet:

1. The ImageDataGenerator is not returning both the original data and the transformed data — the class only returns the randomly transformed data.
1. We call this “in-place” and “on-the-fly” data augmentation because this augmentation is done at training time (i.e., we are not generating these examples ahead of time/prior to training).

The size of the data images that I’m using is “6000x4000”. I don’t know if that’s a possible reason why the training is very slow, since “ImageDataGenerator is a batches Generator of tensor image data with real-time data augmentation”, but I’m not sure.

That was always the way I loaded and stored data (1024x1024 images, and so on), but that didn’t happen to me.

DIANA7 · January 5, 2025, 2:03pm

To load and store data, I’m using ImageDataGenerator. I found the following on internet:

The ImageDataGenerator produces batches of real-time enhanced tensor image data . Each training image can have any random alteration applied to it when it is sent to the model.

I don’t know if that’s a possible reason why the training is very slow, since “ImageDataGenerator is a batches generator of tensor image data with real-time data augmentation”, but I’m not sure.

That was always the way I loaded and stored data (1024x1024 images, and so on), but that didn’t happen before.

TMosh · January 5, 2025, 6:15pm

However long it took to process your 1K x 1K images, the 6K x 4K images will take 24 times longer.

DIANA7 · January 5, 2025, 6:22pm

The preprocessing of the data, then, should be before it is passed to the model? Because, I believe Keras’s ImageDataGenerator applies the transformations to the images as the model is trained.
However, I removed the transformations such as rotation, flip, and only did the resize and rescaling for the images (without applying additional processing) in ImageDataGenerator, but the training time remains the same.

Deepti_Prasad · January 5, 2025, 7:00pm

@DIANA7

can you confirm me once if you followed the same steps?

How to connect kaggle notebook to gpu/tpu

Deepti_Prasad · January 5, 2025, 7:02pm

i actually feel your GPU is not getting connected to kaggle.

I found a notebook regarding this, see if it helps you

however I have come across information that kaggle works slower than colab.

DIANA7 · January 5, 2025, 8:55pm

Thanks! I’ll try it

Deepti_Prasad · January 5, 2025, 8:57pm

let me know what happens once you check

DIANA7 · January 6, 2025, 2:02am

Hi, I followed the steps indicated in the notebooks. However, when I save and run the code to save the notebook to Kaggle, it takes the same amount of time and I’m waiting for over two hours

Deepti_Prasad · January 6, 2025, 3:35pm

@DIANA7

You will have to report or raise query in kaggle for issue related to kaggle notebook. it could be system configuration too creating an issue. also remember as I told kaggle do work slowly, the amount of each epoch training 20 *3000 set of images is a higher for an usual laptop or system you might be working.

Also for the other log info you shared

This is system warning, I found a link pertaining to this

It provides information that this log information has not error but warning that your system might have single cpu socket for the person who encountered similar info

“Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be ‘0’ (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!”

So check your system configuration.

DIANA7 · January 7, 2025, 12:35am

Thank you very much for the help, really!

Topic		Replies	Views
C2_W2_Lab_1 1 epoch is taking 100sec Convolutional Neural Networks in TensorFlow week-module-2	2	527	October 7, 2022
My Nvidia GPU doesn't work with full power to train models Convolutional Neural Networks coursera-platform	7	538	February 11, 2022
You cannot currently connect to a GPU due to usage limits in Colab Generative Deep Learning with TensorFlow week-module-4	9	1718	November 22, 2021
Training process requires quite a long time Convolutional Neural Networks in TensorFlow week-module-1	3	565	February 17, 2022
Additional Content - Running code in kaggle Custom and Distributed Training with TF week-module-3	1	496	June 1, 2023

GPU 0% usage on Kaggle even when I activate it

Related topics