Element-wise multiplication or dot product in backpropagation

Hello there,

I am quite confused about when element-wise multiplication or matrix multiplication (dot product style) are used in the chain rule during backpropagation.

Let’s assume we have a shallow network with 2 inputs, one hidden layer with 3 units and one output unit. We will also use there the notation of Professor Ng → (A0, A1, A2) for activations, (Z1, Z2) for linear operations, (W1 and W2) for weights matrices and J for the cost.

So, using the chain rule during backpropagation to compute the following equation : \frac{\partial \textbf{J}}{\partial \textbf{W1}} = \frac{\partial \textbf{J}}{\partial \textbf{A2}}. \frac{\partial \textbf{A2}}{\partial \textbf{Z2}}.\frac{\partial \textbf{Z2}}{\partial \textbf{A1}}.\frac{\partial \textbf{A1}}{\partial \textbf{Z1}}.\frac{\partial \textbf{Z1}}{\partial \textbf{W1}}

when do we know if the operations in the chain rule are element-wise or dot product ?

I just ask the question because when I compute all these partial derivatives by hand and then look for shape matching between both side of the equation, it is not correct with only matrix multiplications.

The Chain Rule deals with the composition of functions, so how the derivatives are handled depends on what the functions are. In some cases they involve dot products (linear activation) and in some cases they are “elementwise” operations, e.g. the activation functions. So for example \frac {\partial A1}{\partial Z1} is just the derivative of the layer 1 activation function, which was applied elementwise.

This is beyond the scope of this course: Prof Ng does not really cover the underlying calculus. Here’s a thread with lots of links to supplementary material about the mathematics of back propagation.

1 Like

Thank you for your precious reply. I’ll deal with all the ressources you gave