C3_W3_Lab_1_Distributed_Training

Hello there,
I have changed the runtime type to GPU and still getting the following error when echoing the log contents:

E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

1 Like

I am getting that same error. Were you able to solve it?

2 Likes

no, I just neglected it.

Got same error running it at colab after first worker was started though it was a CPU instance of colab and GPU-s were disabled:
2021-12-17 15:09:48.531960: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

I then ran the notebook at my local WSL2 machine immediately after and it showed server has been started at localhost as expected:
2021-12-17 21:12:36.692738: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:12345

Don’t understand why this difference at colab.

More importantly, is there a way to confirm if 2 workers were actually working on the training? Because I do not know how to open grpc://localhost:12345 at the browser to see the contents. Just came across grpcUI. Looks like this will help but there are installation steps using ‘go’ which I need help with. Has anyone used it?
Regards
Sanjoy

1 Like

I’ve got the same issue on Colab… someone finds any solution?

I think getting GPUs in colab environment is not guaranteed. You can run the notebook locally if your machine has GPUs.

Same here. I found that the problem also occurred in the official tutorial. As indicated by others, download it and run it in the local machine would be the current workaround before someone can fix Colab.

This is really so sad!

  • From my understanding, the default executor of Colab default tries to run on GPU, if not detected GPU it will run on CPU. just neglect this error

  • To confirm if 2 workers were actually working on the training, set environment variable at the start of notebook:
    import os
    os.environ[‘TF_CPP_MIN_LOG_LEVEL’] = ‘0’
    and don’t run this line:
    tf.get_logger().setLevel(‘ERROR’)

I downloaded the notebook then ran it as “jupyter nbconvert --execute C3_W3_Lab_1_Distributed_Training.ipynb”.

Here is the Terminal output (Windows 10 WSL):

2022-11-25 22:02:47.132222: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:02:50.159322: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Writing mnist.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
mnist.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step
2022-11-25 22:02:55.941533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:02:55.941642: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:02:55.945605: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/3
70/70 [==============================] - 3s 28ms/step - loss: 2.2931 - accuracy: 0.1087
Epoch 2/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2646 - accuracy: 0.1511
Epoch 3/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2328 - accuracy: 0.2016
Writing main.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
main.py mnist.py
All background processes were killed.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:14.046318: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:15.261391: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:16.903897: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:16.904001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:16.905353: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:16.950720: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:23456
2022-11-25 22:03:16.970748: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:18.453074: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”
op: “TensorSliceDataset”
input: “Placeholder/_0”
input: “Placeholder/_1”
attr {
key: “Toutput_types”
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: “_cardinality”
value {
i: 60000
}
}
attr {
key: “is_files”
value {
b: false
}
}
attr {
key: “metadata”
value {
s: “\n\024TensorSliceDataset:0”
}
}
attr {
key: “output_shapes”
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: “replicate_on_split”
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}

2022-11-25 22:03:18.881633: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer.so.7’; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘libnvinfer_plugin.so.7’; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:16.970100: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 12621691600184511390
2022-11-25 22:03:18.449813: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: “TensorSliceDataset/_2”
op: “TensorSliceDataset”
input: “Placeholder/_0”
input: “Placeholder/_1”
attr {
key: “Toutput_types”
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: “_cardinality”
value {
i: 60000
}
}
attr {
key: “is_files”
value {
b: false
}
}
attr {
key: “metadata”
value {
s: “\n\024TensorSliceDataset:0”
}
}
attr {
key: “output_shapes”
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: “replicate_on_split”
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}

2022-11-25 22:03:18.882790: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227