Didn't understand how gradient computation using back prop is order of N+P

In this video, it is mentioned that “If N nodes and P parameters, compute derivatives in roughly N+P steps rather than N * P steps”.

My assumptions are that N refers to the number of neurons (units) across all the layers, P refers to the number of features in the input + bias parameter. So there will be N * P variables to train.

When doing gradient descent, to minimize the cost, we have to compute dJ_dwi_jk. So, there are a total of N * P derivates to compute at each step (or epoch) of gradient descent.

If that understanding is correct, then we have to do N * P computations any way to find all the derivative values. Now back prop helps in reducing the number of computations at each of those N * P computations. Instead of figuring out the derivatives at each layers when we do forward prop, we can use the previous computations at the next layer.

If we do forward prop, then we will do N * P * (X) number of computations, where X refers to the number of computations for doing forward prop (probably equal to the number of the layers after the current layer). By doing back prop, we are replacing the X with a constant factor, but still doing N * P computations.

Can someone help me where I’m going wrong with the reasoning?

The N here refers to nodes in the computation graph, so not strictly the number of neurons (there are at least multiple computations per neuron, such as dot product, bias addition, and activation).

I believe the P in this case refers to the total number of params to train, so we need P gradients at the end of backprop.

Something seems wrong here. We do not do any derivatives during forward prop. We only calculate derivatives during back prop.

I think, in this high-level example, forward prop should take roughly N+P computations as well (similar to back prop).

In practice, I think it’s difficult to really estimate the number of computations needed with just the number of parameters. Different models use different operations on their params (ex. a 2D conv has a lot more computations per param than say a simple linear dot product).

I’m not sure I understand the reasoning here. It sounds like you’re thinking that back prop is an optimization for forward prop? It’s not just an optimization.

Forward prop and back prop are completely different operations.

Forward prop is used to produce a prediction using the model.

Backward prop is used to product gradients based on loss. The gradients are then used to update the model.

Thank you going point by point in your explanation.

Something seems wrong here. We do not do any derivatives during forward prop. We only calculate derivatives during back prop.

I used the term “forward prop” to mean computing the derivates using forward calculations.

I think, in this high-level example, forward prop should take roughly N+P computations as well (similar to back prop).

Maybe taking a step back from my explanation. My confusion is this assertion that without back prop it would take N * P steps for computing derivates.

Maybe an example of how it would take N * P steps would help me understand how it would be N + P if we use back prop.

Cool, that’s easier to understand now.

It’s worth noting that “forward prop” is a term with its own (important) meaning, so it’s best not to use it to mean something else in another context.

The N * P is an approximate upper bound on the computation time (and so is N + P, which is also just an approximate upper bound).

To be clear, the assertions are as follows:

  • Without computational graphs, the back propagation could be N * P time.
  • With computational graphs, the back propagation could be N + P time.

Where P is the number of parameters, and N is the number of computations per parameter that’s needed to compute the model output (ie. forward prop).

Without computational graphs, to compute the derivative for each parameter (P of them), we will need to calculate the derivative every step of the computation (N steps) separately. There are N steps for every P parameter, and so the time would be N * P.

With computational graphs, we are effectively able to reuse the intermediate derivative computations, and so technically we no longer have to compute the derivative of a node twice. Therefore, the computation is capped at N (being the number of computations) + P (being the number of parameters).

Thank you. That clarified my question. N is the number of computations per parameter that’s needed to compute the model output