Haha, it just doesn’t work there! It will not give you the correct answer. I believe the post that I shared earlier has a counter-example that showed it doesn’t work.

I have a feeling that my answer is not going to satisfy you… however, is it good to take something that is not even supposed to work and ask why doesn’t it work?

I suggest you to start with a neural network of 2 layers, and each layer has 2 neurons. Write down all the equations NOT in matrix form but in scalar form. Then derive the gradient formula for each scalar weight (not matrix weight). Finally, put all the scalar weight gradients back to a matrix in the right places, and you will see how you should order W^{[2]} and dz^{[2]} so that the matrix equation will be consistent with your scalar results.

It is a good thing to work with scalar equations and scalar weights because you can finally use the chain rule.

Cheers,

Raymond

Note that I have quoted in my previous linked post the following:

If you follow my suggestion, and when you put all the scalar weights back to a matrix, you will essentially be “collecting the derivative of each component of the dependent variable with respect to each component of the independent variable” just as what the wikipedia page is talking.