When we put multiple inputs to LLM in a batch, I believe does model processes all the inputs at same time? How does model take care of concurrency?
1. Batching vs. Concurrency
-
Batching: When you send multiple prompts together (a batch), the model processes them in parallel inside a single forward pass through the neural network. Each prompt is treated as an independent sequence, but they share the same model weights. This is mostly a performance optimization — GPUs/TPUs are great at parallel matrix operations.
-
Concurrency: Refers to multiple independent requests coming at the same time (from different users, apps, etc.). Concurrency is usually handled by the serving infrastructure, not by the model itself. The system might group requests into batches under the hood, or run them across multiple GPUs.
2. How the Model Handles a Batch
Imagine you have 3 prompts:
["Translate: Hello", "Summarize: long text...", "Write a poem about cats"]
The model doesn’t merge them — instead, it creates a tensor (matrix) where each prompt is padded to the same length. The transformer layers then process all of them in one go, row by row.
-
The prompts don’t “interfere” with each other.
-
Each has its own attention mask so tokens only attend to their own sequence.
3. Concurrency Behind the Scenes
-
If many users hit the API at once, the serving system queues requests and may dynamically batch them together to maximize GPU utilization.
-
This makes inference faster and cheaper, but to the user it still feels like their request was handled individually.
-
Concurrency control (like avoiding conflicts, timeouts, retries) is at the infrastructure level, not inside the transformer math.