This is of loop one :
dot = 278
----- Computation time = 0.09234899999999158ms
outer
----- Computation time = 0.2062169999998975ms
elementwise multiplication = [81. 4. 10. 0. 0. 63. 10. 0. 0. 0. 81. 4. 25. 0. 0.]
----- Computation time = 0.1319379999999981ms
gdot = [29.85900722 14.98066703 16.87724005]
----- Computation time = 0.1751969999999048ms
This is of vectorized one
The loop sizes are too small to really give you a clear picture of how much better vectorization is. I’m not sure why they did it that way, but you can experiment with 10k elements or 50k elements and you’ll see much clearer results. The other thing to realize is that they are using “wall clock” time for the metrics here, which makes the results not very dependable: you’re at the mercy of what else is going on at the same time that the OS is context switching against your code. You can see this effect by just running the same test over and over. You’ll notice that the results bounce around quite a bit. There are better ways to measure performance by using actual consumed cpu time instead of wall clock time.