Online Batch Inference

In week 4 of course 4, we learn about online inference, and later in the same week, there’s another talk on batch inference. According to the videos, online inference is useful when there’s an API on top of the ML model and there’s a user waiting for the prediction of it. In this case, the inference is done in a batch of size 1. On the other hand, we have the batch inference where it is done offline and we can benefit from accelerators to their fullest since we can batch the samples together and calculate their output much faster.

There’s another scenario that is not discussed in the videos, online batch inference. In order to implement it, we can have a queue between the API and the model. Using this queue, we can group the API requests together and form a batch so we can benefit from an accelerator. In order to implement this design, I’m assuming:

  1. It is acceptable to delay the API’s response by a fraction of a second (maybe up to a second) since adding a queue in between has an overhead
  2. The API has considerable traffic and it could benefit from an accelerator
  3. The size of the batch is not always the same, the batch is formed based on the number of requests arrived in a period of time. If there are more requests than the max batch size that our accelerator can handle, it will be truncated to the max. This is a sign that we need more accelerators to provide an online service.

Now, my questions are:

Please let me know if there’s any problem with this design. If there isn’t, why it was not mentioned in the course?


Synchronous online and batch serving styles are in wide use.
You are welcome to buffer requests for asynchronous computation on your project. This is not widely used AFAIK. Stateless microservices are preferred over batching requests as much as possible.

My question is, why? Why they are not widely used? Other than the delay of the queue, which in case of high traffic could really be negligible, what is the downside of such architecture? (with a high traffic, the queue fills up fast and we don’t need to wait any longer)

Also if they are not widely used, do you mean that inference servers are not using accelerators or the accelerators are working with batches of 1 and basically being wasted?

BTW, I didn’t get why using a queue in between makes it stateful? The way I see it, my design is still stateless!

Inference servers are configured with gpu(s) when the problem requires it.

A machine in a cloud can go offline due to reasons like power failure at that cloud location or due to harddisk crash or network failure. When this happens, the worker instance is usually spun up at another physical machine. Unless you are willing to ignore pending requests, the new worker instance needs to know state of the pending request queue. Request queue is therefore a stateful part of the service.

There’s nothing wrong with the using your approach. You have to decide on the financial impact of persisting requests and the added complexity of notifying callers if their request has failed.