GPU Overhead in

The test actually shows the smallest model has the worst throughput

Throughput comparison:

   qwen-0.5b (0.5B):   10.6 tok/s | 100 tokens in 9.45s
✓ Unloaded Qwen/Qwen2.5-0.5B-Instruct
Loading Qwen/Qwen2.5-1.5B-Instruct...
✓ Loaded Qwen/Qwen2.5-1.5B-Instruct
   qwen-1.5b (1.5B):   66.5 tok/s | 100 tokens in 1.50s
✓ Unloaded Qwen/Qwen2.5-1.5B-Instruct
Loading Qwen/Qwen2.5-3B-Instruct...
✓ Loaded Qwen/Qwen2.5-3B-Instruct
     qwen-3b (  3B):   53.4 tok/s | 100 tokens in 1.87s
✓ Unloaded Qwen/Qwen2.5-3B-Instruct
Loading Qwen/Qwen2.5-7B-Instruct...
✓ Loaded Qwen/Qwen2.5-7B-Instruct
     qwen-7b (  7B):   67.7 tok/s | 100 tokens in 1.48s
✓ Unloaded Qwen/Qwen2.5-7B-Instruct
Loading Qwen/Qwen2.5-0.5B-Instruct...
✓ Loaded Qwen/Qwen2.5-0.5B-Instruct

That’s the opposite of what should happen. It’s the first model executed in the loop, so it absorbed all the cold-start costs — CUDA context init, lazy kernel compilation, cuBLAS/cuDNN autotuning, allocator warmup, HF from_pretrained finalization. Notice the timing: 9.45s for 100 tokens vs ~1.5s for the others. That’s ~8s of one-time overhead, not steady-state decode time.

Hi Lijie_Tu,

Nice catch. To resolve this, a dummy model should be loaded first so that the initialization has no effect on the steady-state decode time. I’ll pass this on.

Thanks!