FSDP Model Sharding: Where does Synchronization take place?

If the model is too big to fit into a single GPU, where does the synchronization process take place? Is it carried out on yet another GPU node or does the it take place in CPU?

In parallel GPUs (either cores or other present GPU’s), if the processing units are different its much harder to run the same operation in parallel, generally speaking!