NumPy and vectorization: an even faster way to calculate the dot product

In the “Optional Lab: Python, NumPy and Vectorization” of week 2 there is a speed comparison of calculating the dot product with 10^7 values, using a for loop versus numpy’s vectorized dot function, b). I messed around with that and came up with a method that runs even faster (~2 times faster) than the dot function, which is simply c=np.sum(a*b).
Now I wonder why that is and why numpy doesn’t simply use the same method inside the dot() function, when it is that much faster? Do we have some experts here that know more?

At a high level the point is that and np.sum of np.multiply are not the same operation, in the general case, right? They are only equivalent when the inputs are both vectors of the appropriate orientation.

I don’t know the details of what happens inside the numpy code, but it’s all open source, right? You’re welcome to take a look if you want to go deeper on any of this. But the point is the logic is calling functions from the underlying math libraries written probably in c or C++ and even a bit assembly language to take advantage of the vector instructions provided by the CPU. There should be a “multiply accumulate” instruction, which is the fundamental building block of matrix multiplication, in addition to plain vanilla add and multiply. The other thing to note is that memory and cache behavior are critical in the performance of this type of operation as well, so it could just be that you got lucky in how the arrays were laid out in memory in one case versus the other.

The best we can do as python/numpy programmers is to use appropriate numpy primitives and then what actually ends up happening at runtime depends on what kind of h/w we have and the underlying linear algebra libraries. If you want to know more about the underlying vector instruction sets, here’s the Wikipedia page for the Intel version of that. If you’ve got a GPU, consult the documentation.

But the real question is how fast even the slower version of the code was compared to the “for loop” version, right? The actual example they give in the notebook is not really big enough to show the difference very clearly, but your version using 10^7 element inputs should show it pretty clearly. :nerd_face: