Vectorization using np.sum()

I did test speed between two functions. first one compute_cost from already found in option labs and compute_cost2 this my function I have built. I use np.sum() to sum up the initial cost before computing final cost which multiply it by 1 / 2 * m. surprisingly I found the compute_cost is faster than compute_cost2. it suppose to be faster compute_cost2
first I tried a small set of data as same as big one and gave me that same result which compute_cost much faster thancompute_cost2

X_train  = np.random.randint(200, size=(10000000, 50))
y_train = np.random.randint(100, size=(10000000))
w_init = np.random.rand(50)

def compute_cost(X, y, w, b): 
    m = X.shape[0]
    cost = 0.0
    for i in range(m):                                
        f_wb_i = np.dot(X[i], w) + b           #(n,)(n,) = scalar (see np.dot)
        cost = cost + (f_wb_i - y[i])**2       #scalar
    cost = cost / (2 * m)                      #scalar    
    return cost

def compute_cost2(x, y, w , b): 
    m = x.shape[0]
    y_hat = np.zeros(m)
    cost  =  0 
    for i in range(m):
        y_hat[i]  = np.dot(x[i],w) + b 
    
    cost = np.sum((y_hat - y) ** 2)
    cost  = cost / (2 * m)
    return cost 

benchmark test

import time 
tic = time.time()  # capture start time
cost2 = compute_cost2(X_train, y_train, w_init, b_init)
toc = time.time()  # capture end time

print(f" compute_cost2 function duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
cost1 = compute_cost(X_train, y_train, w_init, b_init)
toc = time.time()  # capture end time

print(f"compute_cost function duration: {1000*(toc-tic):.4f} ms ")

the result

compute_cost2 function duration: 80717.3033 ms 
compute_cost function duration: 78757.5171 ms 

You are already looping over all examples, which is not necessary when using numpy. You can simply use np.dot(X, w) + b, which should speed things up.

I like to look at the shapes for matrix multiplications:
X has shape (10M, 50)
w has shape (50) (you could reshape this to (50,1)
Matrix dot product with (10M, 50) and (50, 1) results output shape (10M, 1)

You are forcing numpy to multiply (1, 50) and (50,1) 10 million times, which is inefficient.

The following shows the time difference for np.sum very well (10s vs 0.3s for me)

import time 
tic = time.time()  # capture start time
m = X_train.shape[0]
sum = 0
for i in range(m):
  sum += X_train[i]
print(sum)
toc = time.time()  # capture end time

print(f" For-loop sum duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
sum = np.sum(X_train, axis=0) 
print(sum)
toc = time.time()  # capture end time

print(f"Numpy sum duration: {1000*(toc-tic):.4f} ms ")

Numpy is designed for efficient matrix algebra.
Avoid using for-loops if you want code to run efficiently.

1 Like