C3_W3_Lab_1_Distributed_Training

mal-tekreeti · November 19, 2021, 8:23pm

Hello there,
I have changed the runtime type to GPU and still getting the following error when echoing the log contents:

E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

solarflarefx · November 25, 2021, 1:19am

I am getting that same error. Were you able to solve it?

mal-tekreeti · November 25, 2021, 3:12am

no, I just neglected it.

bolt99 · December 17, 2021, 4:08pm

Got same error running it at colab after first worker was started though it was a CPU instance of colab and GPU-s were disabled:
2021-12-17 15:09:48.531960: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

I then ran the notebook at my local WSL2 machine immediately after and it showed server has been started at localhost as expected:
2021-12-17 21:12:36.692738: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:12345

Don’t understand why this difference at colab.

More importantly, is there a way to confirm if 2 workers were actually working on the training? Because I do not know how to open grpc://localhost:12345 at the browser to see the contents. Just came across grpcUI. Looks like this will help but there are installation steps using ‘go’ which I need help with. Has anyone used it?
Regards
Sanjoy

zvika_sinkevich · December 21, 2021, 9:58am

I’ve got the same issue on Colab… someone finds any solution?

mal-tekreeti · December 21, 2021, 1:59pm

I think getting GPUs in colab environment is not guaranteed. You can run the notebook locally if your machine has GPUs.

moorez · January 12, 2022, 4:02pm

Same here. I found that the problem also occurred in the official tutorial. As indicated by others, download it and run it in the local machine would be the current workaround before someone can fix Colab.

dbetm · January 21, 2022, 3:31am

This is really so sad!

tannguyen · August 10, 2022, 10:05am

From my understanding, the default executor of Colab default tries to run on GPU, if not detected GPU it will run on CPU. just neglect this error
To confirm if 2 workers were actually working on the training, set environment variable at the start of notebook:
import os
os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘0’
and don’t run this line:
tf.get_logger().setLevel(‘ERROR’)

TRAN_Ngoc_Thach · November 25, 2022, 9:20pm

I downloaded the notebook then ran it as “jupyter nbconvert --execute C3_W3_Lab_1_Distributed_Training.ipynb”.

Here is the Terminal output (Windows 10 WSL):

2022-11-25 22:02:47.132222: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:02:50.159322: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Writing mnist.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
mnist.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step
2022-11-25 22:02:55.941533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:02:55.941642: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:02:55.945605: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/3
70/70 [==============================] - 3s 28ms/step - loss: 2.2931 - accuracy: 0.1087
Epoch 2/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2646 - accuracy: 0.1511
Epoch 3/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2328 - accuracy: 0.2016
Writing main.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
main.py mnist.py
All background processes were killed.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:14.046318: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:15.261391: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:16.903897: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:16.904001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:16.905353: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:16.950720: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:23456
2022-11-25 22:03:16.970748: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:18.453074: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”
op: “TensorSliceDataset”
input: “Placeholder/_0”
input: “Placeholder/_1”
attr {
key: “Toutput_types”
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: “_cardinality”
value {
i: 60000
}
}
attr {
key: “is_files”
value {
b: false
}
}
attr {
key: “metadata”
value {
s: “\n\024TensorSliceDataset:0”
}
}
attr {
key: “output_shapes”
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: “replicate_on_split”
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}

2022-11-25 22:03:18.881633: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:16.970100: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 12621691600184511390
2022-11-25 22:03:18.449813: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”
op: “TensorSliceDataset”
input: “Placeholder/_0”
input: “Placeholder/_1”
attr {
key: “Toutput_types”
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: “_cardinality”
value {
i: 60000
}
}
attr {
key: “is_files”
value {
b: false
}
}
attr {
key: “metadata”
value {
s: “\n\024TensorSliceDataset:0”
}
}
attr {
key: “output_shapes”
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: “replicate_on_split”
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}

2022-11-25 22:03:18.882790: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227

Topic		Replies	Views
C3W3 - CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected Machine Learning Modeling Pipelines in Production	10	1126	November 13, 2022
Need a help on running model on my laptop Custom and Distributed Training with TF week-1	3	562	November 19, 2021
Work with GPU Generative AI with Large Language Models week-2	3	487	April 8, 2024
Running Notebooks on Local Windows Machine w/ Multiple GPUs Custom and Distributed Training with TF week-4	5	492	June 26, 2023
C3_W3_Lab_1 - error in model.fit Advanced Computer Vision with TensorFlow week-3	7	535	August 19, 2022

C3_W3_Lab_1_Distributed_Training

Related topics