Computation graph: N + P vs N x P

Hello @MrGalaxy,

Let’s try to argue that.

Backprop saves time because it avoids recomputing computed values. In other words, it avoids visiting the same node more than once.

Without backprop, for each parameter w, computing \frac{\partial{J}}{\partial{w}} requires to go through the chain of nodes from J to w, and the length of the chain of nodes is \sim N (or, the worst case is to go through all N nodes). Therefore, for all parameters, it requires a total of N\times P because (in the worst case) we go through the (whole) chain of nodes once per parameter. (N is the worst case. In practice, it is not necessarily the whole chain for every parameter, but it does not matter, because it is not about an absolute measurement but to deliver the idea that it will scale with N)

With backprop, each node only needs to be visited once, and each parameter also only needs to be visited once, so it is N + P.