Distributed training via a few machines

Thank you very much for prity demonstrative lab C3W3_Colab_Lab1_Distributed_Training.ipynb! Can you give me now a good link how to connect my local machines to a one network to do this manipulations on a GPU of each of them? It would be really nice if I will still able to use a VPN on each of them for example.

Hello @someone555777
Welcome! I recommend you explore this additional resources for setting up a multi-node GPU cluster:

  1. Distributed and Parallel Training Tutorials — PyTorch Tutorials 2.0.1+cu117 documentation
  2. Multi-Node Deep Learning Training with TensorFlow - NVIDIA Docs
  3. Autoscaling NVIDIA GPUs on Red Hat OpenShift

Regards
Isaak

I prefer ternsorflow. And where is about connection of pcs through web? As I see, all tutorials describe the process assumed that we have our machines already connected to each other. Maybe can you show me the places where it is described and I just missed it?

Hello @someone555777
If you’re looking to connect your machine, it’s good you consider broader networking and infrastructure topics. I don’t have a specific link for doing that, since the needs might vary depending on your infrastructure but here are a few considerations:

  1. Networking

2. File System:

Hope this gives you an idea.
Regards
Isaak

ok, thank you. So, do I understand correct that my pcs should be mandatory connected in localhost? Can I use VPN after this? Or can they be connected not in localhost net but more widely through web? Maybe any apps exist for this?

@someone555777 Welcome

From my understanding, It’s not necessarily mandatory. distributed training can work across different networks, including the internet, with appropriate networking configurations. After this, you can create a VPN connection to allow your machines to communicate

Hope this helps.

Happy Learning
Isaak