Batch Prediction

ankit_gupta1 · September 17, 2025, 12:17pm

When we put multiple inputs to LLM in a batch, I believe does model processes all the inputs at same time? How does model take care of concurrency?

SteveArthur · September 18, 2025, 2:32pm

1. Batching vs. Concurrency

Batching: When you send multiple prompts together (a batch), the model processes them in parallel inside a single forward pass through the neural network. Each prompt is treated as an independent sequence, but they share the same model weights. This is mostly a performance optimization — GPUs/TPUs are great at parallel matrix operations.
Concurrency: Refers to multiple independent requests coming at the same time (from different users, apps, etc.). Concurrency is usually handled by the serving infrastructure, not by the model itself. The system might group requests into batches under the hood, or run them across multiple GPUs.

2. How the Model Handles a Batch

Imagine you have 3 prompts:

["Translate: Hello", "Summarize: long text...", "Write a poem about cats"]

The model doesn’t merge them — instead, it creates a tensor (matrix) where each prompt is padded to the same length. The transformer layers then process all of them in one go, row by row.

The prompts don’t “interfere” with each other.
Each has its own attention mask so tokens only attend to their own sequence.

3. Concurrency Behind the Scenes

If many users hit the API at once, the serving system queues requests and may dynamically batch them together to maximize GPU utilization.
This makes inference faster and cheaper, but to the user it still feels like their request was handled individually.
Concurrency control (like avoiding conflicts, timeouts, retries) is at the infrastructure level, not inside the transformer math.

Topic		Replies	Views
Transformer parallelism Sequence Models coursera-platform	1	516	June 13, 2022
Does only transformer need padding using max_length? Sequence Models coursera-platform	8	885	March 8, 2023
Max_len different for each batch in Siamese network assignment NLP with Sequence Models week-module-4	3	535	November 25, 2022
DLS 5 - Input/output of varying window sizes Sequence Models coursera-platform	7	534	June 8, 2022
C3_W4 Assignment: Padding in excercise 2 NLP with Sequence Models week-module-4	6	520	March 6, 2023

Batch Prediction

1. Batching vs. Concurrency

2. How the Model Handles a Batch

3. Concurrency Behind the Scenes

Related topics