Vectorization: Comparing R and Python Question

So I am fairly new to Python and fairly well aquatinted with R but fully self taught. As a result I don’t know all of the inner workings behind the curtains in certain functions.

I was just introduced to vectorization in DeepLearning.AI. Is anyone else here well enough aquatinted with R to be able to tell me if np.array()'s optimization is similar to R’s purrr::map() or, more likely, furrr::future_map()?

If anyone else is curious, I will do a simple, tictoc test on purrr/furrr and post the results back here.

Update

It appears part of the answer is simpler than I thought. I had forgotten that R has built in vectorization in its structure. I am not sure how this scales out to parallel processing yet but here are my Python and R chunks and their timed outputs.

Python

import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()

print ("Vectorized version:" + str(1000*(toc-tic)) +"ms")

c = 0
tic = time.time()
for i in range(1000000):
  c += a[i]*b[i]
toc = time.time()

print("For loop:" + str(1000*(toc-tic)) +"ms")

Output

Vectorized version:54.151296615600586ms
For loop:676.0082244873047ms

R

a <- runif(1000000,0,1)
b <- runif(1000000,0,1)

c = 0
tictoc::tic()
c = sum(a * b)
tictoc::toc()

c = 0
tictoc::tic()
  for (i in 1:length(a)) {
    c = a[i]*b[i] + c
  }
tictoc::toc()

Output

Vectorized version: 0.013 sec elapsed
For loop:0.065 sec elapsed

1 Like

Hi, Bryan.

Interesting questions and interesting results! I did learn R quite a while ago when I was taking the Johns Hopkins Data Science series, but that was years ago and I’ve forgotten everything I might have known. I don’t remember furrr or purrr at all.

So you get a big win from vectorization in both cases, but notice that the results seem to show that R is just fundamentally faster. If you convert everything to milliseconds, then the comparison is:

Vectorized: 54 ms in numpy and 13 ms in R
Loop: 676 ms in python and 65 ms in R

So the loop is 10x slower in python than in R and the vectorized code is roughly 4x slower in python. Why would that be? I can think of some possibilities:

  1. Maybe the R code ends up doing 32 bit FP. The way your python code works it is doing 64 bit FP.

  2. The time call in python is giving wall clock time, not cpu time. Do you know what tictoc does in R? Maybe it’s cpu time. You can observe the variability of python numbers by just running the same test multiple times. The results are typically not consistent, because the wall clock time is affected by whatever multitasking is going on at the same time.

  3. Maybe R has a better JIT compiler. It also sometimes takes the python JIT logic a while to get its act together. If you run the same benchmark 5 or 10 times in a row, typically you see it get faster for the first 2 or 3 times and then the numbers just bounce around after that because of the wall clock time issue.

The other point is that the loop is 12.5x slower in python, but only 5x slower in R. That would seem to argue for the JIT compiler being the difference. If that theory is right, R just generates better runtime code.

Of course what is happening in the vectorized case is not really done by R or numpy: it’s the LINPACK libraries invoking underlying HW operations and those should be the same in both cases. But maybe 10^6 operations isn’t enough to hide whatever difference in the setup overhead there is between the two cases. Or maybe that argues for the theory that tictoc in R is measuring cpu time instead of wall clock time. Inquiring minds! :nerd_face:

Super interesting idea’s @paulinpaloalto! I have no idea about the answers to your speculations on FP, no idea if R’s tictoc package uses either wall clock or cpu time, and to be honest, I don’t know what a JIT compiler is :sweat_smile: sounds like there is still loads of learning ahead.

Something I just thought I should have mentioned is I may be comparing apples to oranges in my run times above. I ran both chunks in Rstudio, as (similar to other Python IDE’s) it can operate multiple languages. However, I don’t know if there is extra time required to access the Python engine through Rstudio versus the R engine. Likely, a more proper test would be to run both languages from the command line?

I have a few more hours I need to put in on my thesis work today, but once that is finished I will try a few other experiments.

JIT compiler means “Just In Time” compiler. Both python and R are “interpreted” languages, not compiled languages like c or C++. That means that it would be hideously slow if you just run everything in the interpreter. The way they solve that problem is by generating (compiling) assembly language code from the interpreted input “on the fly” or “just in time”. Then it’s the compiled code that they actually run. But different interpreters are more or less sophisticated in terms of how they manage this process. Some of them just interpret everything the first couple of times and only invest the effort of the compilation in the case of loops or code that is getting invoked a lot.

The questions about what R is actually doing should be answerable with a little research digging around in the R documentation. E.g. the doc for “tictoc” should just say what kind of time it is measuring (wall clock or cpu). A quick google turns up this page about benchmarking in R. It looks like tictoc measures wall clock time, so that would be equivalent to your method in python. They do explain how to use system.time to get the actual cpu time. Now we need to figure out the equivalent in python.

Okay, and am I assuming correctly that cpu time comparison would bypass the concern or running python in Rstudio or R in Jupyter Notebooks because it would just measure the length of time designated for the specific operation?

Actually that’s a good point: I hadn’t thought about the implications of running things in a Jupyter notebook. But no, that’s a different issue. No matter where you run a computer program, you’re running on top of an operating system. It could be Windows, Mac or linux or Unix or Multics or whatever. But the point is that the operating system does what is called “multitasking”, meaning that it’s actually running lots of different programs and background system functions (e.g. accessing the disk or sending and receiving network packets) “underneath”. It looks to you as if your program has complete control of the computer, but it actually doesn’t. So when you measure “wall clock” time, that is just the start and end time of an operation. During the time your program took to complete that operation, you have no way to know how many times the OS “time sliced” you out and ran some interrupt routine because a network packet came in or some other thing happened “in background”. But most (good) OSes also give you a more detailed way to actually measure the amount of cpu time that your program actually used. That’s a more accurate metric for the purposes of the type of question you’re trying to answer here. When you get a cpu time result, it is typically broken down into the user time and system time, meaning time spent actually running your code and time spent making system calls that your program generated during whatever operation it was that you are timing.

I don’t have time to get into the implications of Jupyter notebooks right now, but later today maybe I can reply with more info.