I suggest the formula to compute v_dw and v_db should be rewritten to explain how it could become a momentum for gradient descent

The formula in lesson:

v_dW := beta * v_dW + (1-beta) *dW

W: = W - alpha * v_dW

The rewritten formula:

v_dW := dW + beta * (v_dW - dW)

W := W - alpha * v_dW = W - [ alpha *dW + alpha * beta * (v_dW - dW) ]

So we could see the component beta * (v_dW - dW) is the momentum for gradient descent.

If v_dW < dW : it is a negative acceleration that limits the oscillation

If v_dW > dW: it is a positive acceleration to speed up convergence