Hello there,
I am quite confused about when element-wise multiplication or matrix multiplication (dot product style) are used in the chain rule during backpropagation.
Let’s assume we have a shallow network with 2 inputs, one hidden layer with 3 units and one output unit. We will also use there the notation of Professor Ng → (A0, A1, A2) for activations, (Z1, Z2) for linear operations, (W1 and W2) for weights matrices and J for the cost.
So, using the chain rule during backpropagation to compute the following equation : \frac{\partial \textbf{J}}{\partial \textbf{W1}} = \frac{\partial \textbf{J}}{\partial \textbf{A2}}. \frac{\partial \textbf{A2}}{\partial \textbf{Z2}}.\frac{\partial \textbf{Z2}}{\partial \textbf{A1}}.\frac{\partial \textbf{A1}}{\partial \textbf{Z1}}.\frac{\partial \textbf{Z1}}{\partial \textbf{W1}}
when do we know if the operations in the chain rule are element-wise or dot product ?
I just ask the question because when I compute all these partial derivatives by hand and then look for shape matching between both side of the equation, it is not correct with only matrix multiplications.