Dear all,
This optional video is really cool, I think I understand most of it but fail to grasp the part that it can simplify the computation requirement to N + P time. Does anyone have any resources to explain this further? Thanks in advance.

Backprop saves time because it avoids recomputing computed values. In other words, it avoids visiting the same node more than once.

Without backprop, for each parameter w, computing \frac{\partial{J}}{\partial{w}} requires to go through the chain of nodes from J to w, and the length of the chain of nodes is \sim N (or, the worst case is to go through all N nodes). Therefore, for all parameters, it requires a total of N\times P because (in the worst case) we go through the (whole) chain of nodes once per parameter. (N is the worst case. In practice, it is not necessarily the whole chain for every parameter, but it does not matter, because it is not about an absolute measurement but to deliver the idea that it will scale with N)

With backprop, each node only needs to be visited once, and each parameter also only needs to be visited once, so it is N + P.

Many thanks for your help in the forum. I notice from your linked in that you have done both courses. I have just finished MLS and am interested in learning more about ML, and would appreciate your thoughts of which programme to pursue first. For background, I have some background in Python via CS50P by David Malan.

Personally I have enjoyed MLS very much, though at time it is a bit slow which I compensated by higher playback speed. I think MLS gives me great intuition in ML development, and at the same time, I reckon that there are a lot of practical aspects of model training which are simplified in the course, which I would like to learn more in the next course.

Many thanks for your help in reading this, and appreciate if you can share your thoughts, either by replying directly, or pinpointing me to any post that you might have written in this area.

P.S. I noted this might be off topic, so please feel free to delete this if more appropriate. I did try to PM you but wasnâ€™t able to do soâ€¦

I just checked out the latest syllabus of IBM, and found that it has changed quite a lot. However, I am not able to see the latest list of items because my account is locked to only display the items for the old version of the IBM courses. If I just look at the syllabus (which may not be accurate), I donâ€™t think it is very machine learning centric.

DLS is certainly ML (neural network) centric. The DLS has 5 courses.

The first half of the 1st course

is like a bridge that will go through similar topic as MLS course 1 & 2 but more maths-oriented

introduces notations that will be used in the rest of the DLS

The second half will cover

backprop and vectorization that were optional in MLS

deep neural network

The 2nd course will cover

the concept of bias/variance and regularization techniques

advanced gradient descent algorithms

hyperparameter tunings

The 3rd course will be mainly about general concept of how to improve model through its development cycle, and some other topics like transfer learning.

The 4th course is about CNN and will introduce some popular architectures and their rationales.

The 5th course is about RNN and transformer.

Now, your goal is the following:

To actually train some model,

first, you need to know what techniques you can apply in a model which I believe course 2 will give you what every ML practitioner should have in their toolbox.

second, you need to know how to evolve your model configuration which will be gone through in course 3.

third, you choose the data of interest. If it is image data, then you want to finish course 4. If it is sequential data, then course 5. If it is tabular data, then course 1 will be already sufficient. However, I am not suggesting that we can skip course 1 if you are just interested in image data, instead, I would still recommend you to go through the courses in order.

Lastly, you might give yourself a small challenge even before you start DLS. I had a discussion with another learner in this forum. They had a simple dataset and trained two models with two methods - one is sklearnâ€™s SGDRegressor, and another one is their own gradient descent algorithm which was very similar to the MLSâ€™s. The learner came with some questions but then later on the challenge became how to match the results from both methods, and this is what I want to tell you about. To match them, we need to know the modelâ€™s hyper-parameters well, such as the learning rate. There are also other things we need to know and is beyond MLS but certainly managable. There could be a lot more interesting discussion points if the learner had time to keep moving on, but the learner had to stick to their plan so we stopped. If you are interested in what we have discussed and how the learner did it, you might read from the 4th post of this thread which this link will bring you to.

Thank you so much for your thoughtful response. Indeed you are right that the IBM course is less about ML, but wider data science topic per se. I really appreciate your review of DLS. In fact I am very new to Data science / ML / python writing so everything seems so interesting to learn. It is very good advice for me to think about which areas of work I am interested in pursueing further first.

u just made me donâ€™t need to write another discussion topic to check â€śwhy backprop is saving time and when say the P(parameters) does it indicates only the w? or each node parameter in the computation graph like t,z,g,aâ€ť

I have one more question, why the forward prop needs to go through the whole chain of nodes once per parameter? canâ€™t the forward prop can find each nodeâ€™s derivatives as same as back prop but just in order of left to right?

like in here, if we do the forward prop, we can find how much c changed if w changes, like dc/dw, then da/dc, dd/da, dj/dd, then we can find the dj/dw?
Then no matter what value the w changes we can still get the j?

This is my original words. I said â€świthout backpropâ€ť, but it does not mean what you said in your following question - I didnâ€™t say forward prop needs to do that.

However, your following question makes a lot of sense:

Certainly you can do it in that way, but I believe you would also agree that, we are just shifting the same amount of work from the backprop phase to the forward prop phase. Even in your way, to calculate c=wx and to calculate dc/dw, we need to visit the node two times. Two calulations mean two times. You canâ€™t calculate both at the same time. You get w and x, multiply them to get c. Then, you get x to calculate dc/dw. They are two visits. No matter you do dc/dw in the forward prop phase or in the backprop phase, they are always two visits.

So if forward prop and backward prop are all need 2 times visit in order to get the derivatives of a node, then again why we need to use backward prop then or what is the benefits?

does he mean without backward prop, and not using forward prop for derivatives, but only use forward prop to keep calculating the final output value, for each time the input x changes(no derivatives concept), that is why it needs n times p steps?

I think the problem is there were only two scenarios in your mind?

forward prop that calculates \hat{y}

backprop that calculate gradients which can happen in the â€śbackpropâ€ť phase or the â€śforwardâ€ť phase.

However, there is this third scernario without any specific name because we donâ€™t do it:

calculate gradient of J with respect to a weight, one at a time, without reusing anything.
For example, we do the steps to calculate dJ/dw on the first piece of paper, then we copy that dJ/dw result onto a whiteboard. Then we throw that first piece of paper into a rubbish bin. Then, we get a new piece of paper to do all the steps again for dJ/db without reusing anything we have got in calculating dJ/dw.

Scenario 2 gives you N + P. Scenario 3 gives you N x P. When we try to understand the slow approach, we canâ€™t fill up our mind with the fast approach, can we?

Now, we can calculate dc/dw in either the forward or the backward phase, why the backward phase? I think itâ€™s a matter of how you deliver a concept or a framework in a systematic way.

To be honest, itâ€™s not immediately useful to calculate dc/dw in the forward phase until you have dJ/dc which is only possible after completing the forward phase. Because at the end of the day, you want dJ/dw instead of dc/dw.

I had never tried or thought about your approach, so I am not sure if there will be other disadvantages. You can tell more if you implement it