In week 4 of course 4, we learn about online inference, and later in the same week, there’s another talk on batch inference. According to the videos, online inference is useful when there’s an API on top of the ML model and there’s a user waiting for the prediction of it. In this case, the inference is done in a batch of size 1. On the other hand, we have the batch inference where it is done offline and we can benefit from accelerators to their fullest since we can batch the samples together and calculate their output much faster.
There’s another scenario that is not discussed in the videos, online batch inference. In order to implement it, we can have a queue between the API and the model. Using this queue, we can group the API requests together and form a batch so we can benefit from an accelerator. In order to implement this design, I’m assuming:
- It is acceptable to delay the API’s response by a fraction of a second (maybe up to a second) since adding a queue in between has an overhead
- The API has considerable traffic and it could benefit from an accelerator
- The size of the batch is not always the same, the batch is formed based on the number of requests arrived in a period of time. If there are more requests than the max batch size that our accelerator can handle, it will be truncated to the max. This is a sign that we need more accelerators to provide an online service.
Now, my questions are:
Please let me know if there’s any problem with this design. If there isn’t, why it was not mentioned in the course?