I downloaded this lab C3W3_Colab_Lab1_Distributed_Training.ipynb to my 2 machines.
I connected 2 pcs to local network. Means that I can ping them each other by ip like this 192.168.0.236.
After that I just commented for pc with ping 192.168.0.91 all code strings after header “Launch the second worker” and run. Everything was run successfully.
At the very top, I see that the Device is in the SYN_SENT state and not the ESTABLISHED state.
This is to be expected if they are on different devices and one hasn’t been started.
Can you run that cell when both are running and see if they ever reach the ESTABLISHED state?
You mentioned using WSL.
You might have to do port forwarding in WSL.
Here is a stack overflow for someone setting up some web server but the idea for how to fix is the same.
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = make_or_restore_model() # Restore from the checkpoint saved by the chief.
results = model.evaluate(val_dataset)
# Then, log the results on a shared location, write TensorBoard logs, etc
Is it not something that we need? We don’t have it in lab at all
# Create and compile model following the distributed strategy
with strategy.scope():
multi_worker_model = mnist.build_and_compile_cnn_model()
# Train the model
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)
Have you fixed the networking between your devices?
The two nodes don’t seem to be able to talk with each other.
You were absolutely sure! I should forward 12345, 23456 ports on both machines from windows to wsl. Just one problem now — I have distributed training time in 100 times more than localy on one machine