Distributed trainin doesn't work on a 2 machines

I downloaded this lab C3W3_Colab_Lab1_Distributed_Training.ipynb to my 2 machines.
I connected 2 pcs to local network. Means that I can ping them each other by ip like this 192.168.0.236.

And changed this section for both pcs the same

After that I just commented for pc with ping 192.168.0.91 all code strings after header “Launch the second worker” and run. Everything was run successfully.


After that on the second pc I commented everything from “Launch the first worker”.

After that on running of all cells I see non-stoping excecution of main.py script

And just nothing appeared. What do you think a problem can be?

Hi @someone555777 !

You need to set the worker index on the other machine to 1. Whichever one is not the chief device.

See the keras docs here: Keras Guide Distributed Training

Hope this helps!

Sam

I did it. Anyway as you see on last screen we have string like this in the lab for second machine
tf_config['task']['index'] = 1

Ah my mistake.

Looking at the output of the bash cells it looks like there is a cuda error.
Can you check if the script / training runs for each machine by itself?

We did os.environ["CUDA_VISIBLE_DEVICES"] = "-1". I think, it should be.

Yes, they run separately if nothing changes in lab.

I thought that problem that I have tensorflow in wsl on both machines. But looks like that ping goes fine from them too.

At the very top, I see that the Device is in the SYN_SENT state and not the ESTABLISHED state.
This is to be expected if they are on different devices and one hasn’t been started.

Can you run that cell when both are running and see if they ever reach the ESTABLISHED state?

yes, I never seen, that they both have status ESTABLISHED. Usually one is LISTENED and another is SYN_SENT on both pcs

Can you confirm that both devices are listening for traffic on port 12345 and both devices are sending to port 12345?

Sorry misspoke, for each device confirm that the other one is sending data to the port the other one is listening on.

At this point I’m not entirely sure what the issue is.

Best I can do is give some resources:
StackOverflow
Tensorflow Distributed TF Config Docs
Tensorflow Ecosystem Github Examples

sorry, my bad. This is what I see on main device

and this is secondary device

Is 192.168.0.91 LAPTOP-BFC4E44F ?

If yes, can you check if your network is blocking this type of traffic?

If 192.168.0.91 is the DESKTOP device listening on port 23456 can you set the port to 12345 and see what happens.

yes

I don’t see any events of firewall in Event viewer. Or how do you want I check?

it is not

I see,

You mentioned using WSL.
You might have to do port forwarding in WSL.
Here is a stack overflow for someone setting up some web server but the idea for how to fix is the same.

can you help me what command should I type for example on 192.168.0.91 LAPTOP-BFC4E44F?

And just a little remark — the commands, that I shown you in 4 message above were from wsl enviroment.

I’ve found in this tutorial the string

On the evaluator:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  model = make_or_restore_model()  # Restore from the checkpoint saved by the chief.

results = model.evaluate(val_dataset)
# Then, log the results on a shared location, write TensorBoard logs, etc

Is it not something that we need? We don’t have it in lab at all

The equivalent code in the lab would be:

# Create and compile model following the distributed strategy
with strategy.scope():
  multi_worker_model = mnist.build_and_compile_cnn_model()

# Train the model
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)

Have you fixed the networking between your devices?
The two nodes don’t seem to be able to talk with each other.

1 Like

You were absolutely sure! I should forward 12345, 23456 ports on both machines from windows to wsl. Just one problem now — I have distributed training time in 100 times more than localy on one machine :sweat_smile: