The following outcome was obtained from the notebook for lesson 6. What could be the potential reason for the increased latency, instead of seeing an improvement, when using torch.index_select
to collect all necessary LoRA matrices for the entire batch in one go? It appears that there are no benefits from parallel processing in this case.