In C2_M1_Lab_4_Efficiency.ipynb, we measure the inference time (model(single_image_batch)) only on the first image of the test_dataset using the code below.
While for simple classification tasks, this is fine, for tasks like instance segmentation, this method yields an incomplete measure. I wanted to share my experience with this. I hope this will be helpful to someone.
For a real-world production facing instance segmentation application, I had to do this step differently. Instead of only “inference time”, I measured “prediction time”, the total time of preprocessing, inference, and postprocessing. As inference time can be proportional to the count of objects in an image, I measured prediction time on the entire test dataset instead.
This is from C2_M1_Lab_4_Efficiency.ipynb:
def measure_inference_time(model, input_data, num_iterations=100):
"""Measures the average inference time of a model for a given input.
Args:
model: The PyTorch model (nn.Module) to be benchmarked.
input_data: A sample input tensor for the model.
num_iterations: The number of times to run inference for averaging.
Returns:
The average inference time in milliseconds (ms).
"""
model.eval()
device = next(model.parameters()).device
input_data = input_data.to(device)
with torch.no_grad():
for _ in range(10):
_ = model(input_data)
start_time = time.time()
with torch.no_grad():
for _ in range(num_iterations):
_ = model(input_data)
end_time = time.time()
avg_time = (end_time - start_time) / num_iterations
return avg_time * 1000
# Create an iterator to get a single batch for inference time measurement.
data_iter = iter(test_loader)
batch = next(data_iter)
# Extract the input tensors from the batch.
inputs = batch[0]
# Create a single-sample batch and move it to the correct device.
sample_input = inputs[:1].to(device)
# Measure the average inference time using the sample input.
inf_time = measure_inference_time(model, sample_input)
Problem:
Two instance segmentation models. One is YOLOv8Seg, and the other is a transformer-based instance segmentation model that I was developing. Both are tested on the CPU using ONNX. For testing/deployment, all models go through the same prediction steps: Preprocessing → Inference → Postprocessing. Here, inference is equivalent to model(image).
According to only inference time, YOLOv8 was faster. However, YOLOv8 doesn’t produce the full instance segmentation masks as inference output. It produces encoded (protos) masks that later have to be decoded and then resized back to the original input image size. All this work is done in the postprocessing block. Making the postprocessing time quite significant. Further, I noticed that both the inference and post-processing times are proportional to the number of objects present in the test image.
Solution:
If I only measure the inference time, I am not taking into account the post-processing time. Without post-processing, the model’s inference output is unusable. The solution was to measure prediction time = preprocessing + inference + postprocessing time.
If I only measure the prediction time for the first image of the test dataset, that does not represent what I will face in production. Both inference and postprocessing times are proportional to the count of detectable objects in the input image. The measured prediction time will be wildly different depending on whether the first image in the test dataset contains 1 or 100 detectable objects. The solution is to run a prediction on the entire test_dataset a few times, then get the average prediction time per image.
Time Measurement Plots:
I tested on 3 images with varying counts of objects to make the dummy plots below. The conclusion was that if we were going for an application that most often has 60+ detectable objects in the scene, YOLOv8 is not the best choice.



