Vectorization VS Explicit For Loop

I don’t quite understand why the vectorization is much faster than explicit FOR loops given that for W of dimension n_x and X of dimension (n_x,m), WTX involves executing n_x*m multiplication operations in both the cases.

There are two aspects, i.e, 1) parallelization for multiple threads/cores and 2) instruction set to handle multiple data by one instruction. In most of cases, those are hidden by OS, or libraries.

  1. parallelization
    If you write a for loop to iterate like 10,000 times, it is a sequential process of data 10,000 times on a single thread in most cases. So, it totally depends on the speed of one thread in a processor core.
    If you write this with a vector, it is easy to parallelize. So, libraries/OS can break those vector operations into multiple parts and dispatch to multiple threads/cores.
  2. processor instructions
    CPU, sometimes called a scalar processor, handles one or a few data in one instruction. But, there is a feature, called SIMD (Single Instruction/Multiple Data). This feature can load large number of data and process by just one instruction. So, if you use vector/array implementation, it can be loaded into SIMD and processed by just one instruction.

And, I think you have heard GPU. It is commonly used for graphic/gaming, but also used by machine learning. GPU has huge number of cores (like 2,000) with SIMD. The processing speed for “for loop” with one-by-one operation on a single thread v.s. parallel processing/multiple data operation with 2,000 cores… The difference can be huge.

And, most of code for parallelization and SIMD instructions are hidden. So, if you just use vector implementation, then, you can take advantages of those.

(Just for clarification, one core has multiple threads. So, single thread performance is actually sharing a single core’s capability by multiple threads.)

Modern CPUs (even in laptops) have built-in instructions for handling vectors efficiently. This logic was originally developed for handling 3D graphics computations, which typically involve lots of matrix and vector calculations. Once 3D graphics became popular, the chip companies expanded the CPU instruction sets to include vector support. Here’s the Wikipedia page for Intel’s standard vector instruction set.

The high level point is that the compiler or interpreter is not capable of translating your for loops into vector instructions. It can perform a lot of optimizations, but that’s too complicated. So you need to explicitly write your code to express the computations using vector primitives. Numpy is the python library for doing that, but other languages have their own support for vectors. Some like MATLAB even build it into the base language, but it’s an “add-on” in python. Then what the numpy python code does is actually call the lower level vector functions in the linear algebra libraries for the CPU which are typically written in a combination of c and assembly language to take advantage of the CPU vector instruction set.

Here is a toy example. Calculate element-wise product (Hadamard product) of (10000,10000) matrix by “for loop” and “np.multiply”.

For loop : 62.16 sec.
Vector operation : 0.355 sec

Loop unrolling may be supported by a compiler in some cases, but, I prefer not to use “for loop” in my programming with the above reason. :slight_smile:

Dear Nobu ( Hope I can address you like that :slight_smile: ),

I was going through the assignment and came to this section and it seemed like the in my system, for loops are faster that the vectorization in our case. The above example is quite convincing and I have also noticed the difference in the old laptop that I have used for running these algos, but never saw the below:
FOr Loop:
dot = 278
----- Computation time = 0.09255900000004758ms
outer = [[81. 18. 18. 81. 0. 81. 18. 45. 0. 0. 81. 18. 45. 0. 0.]
[18. 4. 4. 18. 0. 18. 4. 10. 0. 0. 18. 4. 10. 0. 0.]
[45. 10. 10. 45. 0. 45. 10. 25. 0. 0. 45. 10. 25. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[63. 14. 14. 63. 0. 63. 14. 35. 0. 0. 63. 14. 35. 0. 0.]
[45. 10. 10. 45. 0. 45. 10. 25. 0. 0. 45. 10. 25. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[81. 18. 18. 81. 0. 81. 18. 45. 0. 0. 81. 18. 45. 0. 0.]
[18. 4. 4. 18. 0. 18. 4. 10. 0. 0. 18. 4. 10. 0. 0.]
[45. 10. 10. 45. 0. 45. 10. 25. 0. 0. 45. 10. 25. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
----- Computation time = 0.22416799999991355ms
elementwise multiplication = [81. 4. 10. 0. 0. 63. 10. 0. 0. 0. 81. 4. 25. 0. 0.]
----- Computation time = 0.11251499999986869ms
gdot = [17.73515961 12.50568493 31.70694917]
----- Computation time = 0.4266800000001236ms

Vectorized Dot Product…

dot = 278
----- Computation time = 0.583562000000093ms
outer = [[81 18 18 81 0 81 18 45 0 0 81 18 45 0 0]
[18 4 4 18 0 18 4 10 0 0 18 4 10 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[63 14 14 63 0 63 14 35 0 0 63 14 35 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[81 18 18 81 0 81 18 45 0 0 81 18 45 0 0]
[18 4 4 18 0 18 4 10 0 0 18 4 10 0 0]
[45 10 10 45 0 45 10 25 0 0 45 10 25 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
----- Computation time = 0.170143000000067ms
elementwise multiplication = [81 4 10 0 0 63 10 0 0 0 81 4 25 0 0]
----- Computation time = 0.1312960000001695ms
gdot = [17.73515961 12.50568493 31.70694917]
----- Computation time = 0.562523000000148ms

THis is just a mention and I am not contesting anything :slight_smile:

Very small matrix that you are using for comparison…

In my PC,

elapsed time (for loop) =0.0002512931823730469
elapsed time (matrix operation) =6.413459777832031e-05

Still matrix operation is faster, but, comparison with this size should include several noises, and can not be pure comparison between “for loop” and “vector operation”. Due to a small overhead for vectorization, very small “for loop” may be faster. Of course, a compiler can perform a loop unrolling, if “for loop” is simple enough. This loop unrolling breaks your “for loop” into instructions to best use existing features like SIMD to get better performance.

I just did a quick testing with using equations that an assignment used with changing orders of x1 and x2. Here is the result.

As you see, with using a small matrix, the difference is small. That is what you saw in an assignment.