Distributed trainin doesn't work on a 2 machines

someone555777 · August 29, 2023, 7:02pm

I downloaded this lab C3W3_Colab_Lab1_Distributed_Training.ipynb to my 2 machines.
I connected 2 pcs to local network. Means that I can ping them each other by ip like this 192.168.0.236.

And changed this section for both pcs the same

After that I just commented for pc with ping 192.168.0.91 all code strings after header “Launch the second worker” and run. Everything was run successfully.

After that on the second pc I commented everything from “Launch the first worker”.

After that on running of all cells I see non-stoping excecution of main.py script

And just nothing appeared. What do you think a problem can be?

SamReiswig · August 29, 2023, 7:54pm

Hi @someone555777 !

You need to set the worker index on the other machine to 1. Whichever one is not the chief device.

See the keras docs here: Keras Guide Distributed Training

Hope this helps!

Sam

someone555777 · August 29, 2023, 7:56pm

I did it. Anyway as you see on last screen we have string like this in the lab for second machine
tf_config['task']['index'] = 1

SamReiswig · August 29, 2023, 8:39pm

Ah my mistake.

Looking at the output of the bash cells it looks like there is a cuda error.
Can you check if the script / training runs for each machine by itself?

someone555777 · August 29, 2023, 8:54pm

We did os.environ["CUDA_VISIBLE_DEVICES"] = "-1". I think, it should be.

Yes, they run separately if nothing changes in lab.

someone555777 · August 29, 2023, 9:00pm

I thought that problem that I have tensorflow in wsl on both machines. But looks like that ping goes fine from them too.

SamReiswig · August 29, 2023, 9:09pm

At the very top, I see that the Device is in the SYN_SENT state and not the ESTABLISHED state.
This is to be expected if they are on different devices and one hasn’t been started.

Can you run that cell when both are running and see if they ever reach the ESTABLISHED state?

someone555777 · August 29, 2023, 9:11pm

yes, I never seen, that they both have status ESTABLISHED. Usually one is LISTENED and another is SYN_SENT on both pcs

SamReiswig · August 29, 2023, 9:21pm

Can you confirm that both devices are listening for traffic on port 12345 and both devices are sending to port 12345?

Sorry misspoke, for each device confirm that the other one is sending data to the port the other one is listening on.

SamReiswig · August 29, 2023, 9:49pm

At this point I’m not entirely sure what the issue is.

Best I can do is give some resources:
StackOverflow
Tensorflow Distributed TF Config Docs
Tensorflow Ecosystem Github Examples

someone555777 · August 30, 2023, 12:28pm

sorry, my bad. This is what I see on main device

and this is secondary device

SamReiswig · August 30, 2023, 1:01pm

Is 192.168.0.91 LAPTOP-BFC4E44F ?

If yes, can you check if your network is blocking this type of traffic?

If 192.168.0.91 is the DESKTOP device listening on port 23456 can you set the port to 12345 and see what happens.

someone555777 · August 30, 2023, 1:44pm

yes

I don’t see any events of firewall in Event viewer. Or how do you want I check?

it is not

SamReiswig · August 30, 2023, 6:20pm

I see,

You mentioned using WSL.
You might have to do port forwarding in WSL.
Here is a stack overflow for someone setting up some web server but the idea for how to fix is the same.

someone555777 · August 30, 2023, 8:09pm

can you help me what command should I type for example on 192.168.0.91 LAPTOP-BFC4E44F?

And just a little remark — the commands, that I shown you in 4 message above were from wsl enviroment.

someone555777 · September 8, 2023, 4:54pm

I’ve found in this tutorial the string

On the evaluator:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  model = make_or_restore_model()  # Restore from the checkpoint saved by the chief.

results = model.evaluate(val_dataset)
# Then, log the results on a shared location, write TensorBoard logs, etc

Is it not something that we need? We don’t have it in lab at all

SamReiswig · September 9, 2023, 12:11am

The equivalent code in the lab would be:

# Create and compile model following the distributed strategy
with strategy.scope():
  multi_worker_model = mnist.build_and_compile_cnn_model()

# Train the model
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)

Have you fixed the networking between your devices?
The two nodes don’t seem to be able to talk with each other.

SamReiswig · September 9, 2023, 3:53am

someone555777 · September 11, 2023, 9:04pm

You were absolutely sure! I should forward 12345, 23456 ports on both machines from windows to wsl. Just one problem now — I have distributed training time in 100 times more than localy on one machine

Topic		Replies	Views
C3W3 - CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected Machine Learning Modeling Pipelines in Production	10	1000	November 13, 2022
Error: Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	1	474	June 12, 2023
Running Notebooks on Local Windows Machine w/ Multiple GPUs Custom and Distributed Training with TF week-4	5	490	June 26, 2023
Distributed Multi-worker TensorFlow Training on Kubernetes is currently unavailable Machine Learning Modeling Pipelines in Production	8	641	February 9, 2022
C3_W3_Lab_1_Distributed_Training Machine Learning Modeling Pipelines in Production	9	754	November 25, 2022

Distributed trainin doesn't work on a 2 machines

Related topics