Does model sharding fully utilize all GPUs?

When my HF transformers models get sharded across multiple GPU because it is too large to fit into the VRAM of a single GPU, I notice that only one of the GPU gets 100% utilization at a time, and each model takes turn to get 100% utilization while the others are at 0%.

What sharding method does the transformer library use? Does ZeRO/FSDP help to fully utilize all GPUs?