The gathered latency is not better than the loop latency in lesson 6

The following outcome was obtained from the notebook for lesson 6. What could be the potential reason for the increased latency, instead of seeing an improvement, when using torch.index_select to collect all necessary LoRA matrices for the entire batch in one go? It appears that there are no benefits from parallel processing in this case.