Derive backpropagation in CNN

Dear Mentors/classmates,

In back propagation, our goal is to find dL/dW and dL/dB and updates the weight and bias using gradient descent. When we calculate dL/dW, we need to have dL/dZ first. So that we could apply chain rule.

dL/dW = dL/dZ * dZ/dW ???

My first question is that can we just apply chain rule while W is a tensor (e.g. 3x3x3, f x f x 3)?

My 2nd questions is that
when Z=A * x which Z is matrix and x is a vector, dZ/dx = A, result a matrix
when Z = A * X which Z, A, X are all matrix, dZ/dX = A, result a matrix

Yet, when Z = W *cross X + B, here W does NOT multiply X, it is cross correlate with X, how could I express cross correlation as multiplication, so that i could apply chain rule ? OR is there a special theorem of derivative for cross correlation?


If you did not take the first course of this specialization, it may be better to, at least, quickly look at the lessons for a back-propagation. There are lots of hints in there.

And, your questions are basically linear algebra related, I should start with some recaps.

Scaler to Scalar:
x \in \mathbb{R}, \ y \in \mathbb{R} : A derivative is \frac{\partial y}{\partial x} \in \mathbb{R}

Vector to Scalar:
x \in \mathbb{R}^N,\ \ y \in \mathbb{R} : A derivative is Gradient. \frac{\partial y}{\partial x} \in \mathbb{R}^N, \ \ (\frac{\partial y}{\partial x})_n = \frac{\partial y}{\partial x_n}

Vector to Vector:
x \in \mathbb{R}^N,\ \ y \in \mathbb{R}^M : A derivative is Jacobian. \frac{\partial y}{\partial x} \in \mathbb{R}^{N\times M}, \ \ (\frac{\partial y}{\partial x})_{n,m} = \frac{\partial y_m}{\partial x_n}

In the case of backprop, Loss is basically a “scalar”. So, there should be no problem to start with.

For derivative of dot product, inter product, summation, etc,… we may start with breakdown into each element to calculate partial derivatives, but here is a good summary that I also sometimes refer. It is called The Matrix Cookbook.

If you want to study math for Backprop, this and this should be a good starting point. Those cover more than Andrew’s intuitions.