The test actually shows the smallest model has the worst throughput
Throughput comparison:
qwen-0.5b (0.5B): 10.6 tok/s | 100 tokens in 9.45s
✓ Unloaded Qwen/Qwen2.5-0.5B-Instruct
Loading Qwen/Qwen2.5-1.5B-Instruct...
✓ Loaded Qwen/Qwen2.5-1.5B-Instruct
qwen-1.5b (1.5B): 66.5 tok/s | 100 tokens in 1.50s
✓ Unloaded Qwen/Qwen2.5-1.5B-Instruct
Loading Qwen/Qwen2.5-3B-Instruct...
✓ Loaded Qwen/Qwen2.5-3B-Instruct
qwen-3b ( 3B): 53.4 tok/s | 100 tokens in 1.87s
✓ Unloaded Qwen/Qwen2.5-3B-Instruct
Loading Qwen/Qwen2.5-7B-Instruct...
✓ Loaded Qwen/Qwen2.5-7B-Instruct
qwen-7b ( 7B): 67.7 tok/s | 100 tokens in 1.48s
✓ Unloaded Qwen/Qwen2.5-7B-Instruct
Loading Qwen/Qwen2.5-0.5B-Instruct...
✓ Loaded Qwen/Qwen2.5-0.5B-Instruct
That’s the opposite of what should happen. It’s the first model executed in the loop, so it absorbed all the cold-start costs — CUDA context init, lazy kernel compilation, cuBLAS/cuDNN autotuning, allocator warmup, HF from_pretrained finalization. Notice the timing: 9.45s for 100 tokens vs ~1.5s for the others. That’s ~8s of one-time overhead, not steady-state decode time.