Got same error running it at colab after first worker was started though it was a CPU instance of colab and GPU-s were disabled:
2021-12-17 15:09:48.531960: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I then ran the notebook at my local WSL2 machine immediately after and it showed server has been started at localhost as expected:
2021-12-17 21:12:36.692738: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:12345
Donât understand why this difference at colab.
More importantly, is there a way to confirm if 2 workers were actually working on the training? Because I do not know how to open grpc://localhost:12345 at the browser to see the contents. Just came across grpcUI. Looks like this will help but there are installation steps using âgoâ which I need help with. Has anyone used it?
Regards
Sanjoy
Same here. I found that the problem also occurred in the official tutorial. As indicated by others, download it and run it in the local machine would be the current workaround before someone can fix Colab.
From my understanding, the default executor of Colab default tries to run on GPU, if not detected GPU it will run on CPU. just neglect this error
To confirm if 2 workers were actually working on the training, set environment variable at the start of notebook:
import os
os.environ[âTF_CPP_MIN_LOG_LEVELâ] = â0â
and donât run this line:
tf.get_logger().setLevel(âERRORâ)
I downloaded the notebook then ran it as âjupyter nbconvert --execute C3_W3_Lab_1_Distributed_Training.ipynbâ.
Here is the Terminal output (Windows 10 WSL):
2022-11-25 22:02:47.132222: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:02:50.159322: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer.so.7â; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer_plugin.so.7â; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:02:50.159699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Writing mnist.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
mnist.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step
2022-11-25 22:02:55.941533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:02:55.941642: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:02:55.945605: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/3
70/70 [==============================] - 3s 28ms/step - loss: 2.2931 - accuracy: 0.1087
Epoch 2/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2646 - accuracy: 0.1511
Epoch 3/3
70/70 [==============================] - 2s 27ms/step - loss: 2.2328 - accuracy: 0.2016
Writing main.py
/bin/bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
main.py mnist.py
All background processes were killed.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer.so.7â; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer_plugin.so.7â; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:14.046318: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:15.261391: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer.so.7â; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer_plugin.so.7â; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:15.261597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:16.903897: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:16.904001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:16.905353: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:16.950720: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:23456
2022-11-25 22:03:16.970748: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:18.453074: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: âTensorSliceDataset/_2â
op: âTensorSliceDatasetâ
input: âPlaceholder/_0â
input: âPlaceholder/_1â
attr {
key: âToutput_typesâ
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: â_cardinalityâ
value {
i: 60000
}
}
attr {
key: âis_filesâ
value {
b: false
}
}
attr {
key: âmetadataâ
value {
s: â\n\024TensorSliceDataset:0â
}
}
attr {
key: âoutput_shapesâ
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: âreplicate_on_splitâ
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}
2022-11-25 22:03:18.881633: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227
bash: /home/ubuntu/miniconda3/envs/tf/lib/libtinfo.so.6: no version information available (required by bash)
2022-11-25 22:03:03.769194: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:04.915278: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer.so.7â; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915563: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library âlibnvinfer_plugin.so.7â; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/ubuntu/miniconda3/envs/tf/lib/
2022-11-25 22:03:04.915588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-11-25 22:03:06.478942: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-25 22:03:06.479005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ComputerName): /proc/driver/nvidia/version does not exist
2022-11-25 22:03:06.479747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 22:03:06.531611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://localhost:12345
2022-11-25 22:03:06.558280: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 8350604406956469984
2022-11-25 22:03:06.559727: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2022-11-25 22:03:16.970100: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:502] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 12621691600184511390
2022-11-25 22:03:18.449813: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: âTensorSliceDataset/_2â
op: âTensorSliceDatasetâ
input: âPlaceholder/_0â
input: âPlaceholder/_1â
attr {
key: âToutput_typesâ
value {
list {
type: DT_FLOAT
type: DT_INT64
}
}
}
attr {
key: â_cardinalityâ
value {
i: 60000
}
}
attr {
key: âis_filesâ
value {
b: false
}
}
attr {
key: âmetadataâ
value {
s: â\n\024TensorSliceDataset:0â
}
}
attr {
key: âoutput_shapesâ
value {
list {
shape {
dim {
size: 28
}
dim {
size: 28
}
}
shape {
}
}
}
}
attr {
key: âreplicate_on_splitâ
value {
b: false
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
args {
type_id: TFT_TENSOR
args {
type_id: TFT_INT64
}
}
}
}
}
2022-11-25 22:03:18.882790: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch 1/3
70/70 [==============================] - 11s 106ms/step - loss: 2.2763 - accuracy: 0.2213
Epoch 2/3
70/70 [==============================] - 8s 110ms/step - loss: 2.2331 - accuracy: 0.3901
Epoch 3/3
70/70 [==============================] - 7s 105ms/step - loss: 2.1870 - accuracy: 0.5227