Gradient descent with Max Pooling (DSL 4 - Week 1)


In the video “Pooling Layers” Andrew says that there are no parameters to learn in a Pooling layer. I understand this, but I don’t understand how you should update the weights of the layers that come before a Max Pooling layer since the derivative of the max function is not defined, so during backward propagation, we couldn’t apply the chain rule after running into a max pooling layer.

Someone could help me to understand this?


It’s an important and perceptive question that I don’t think Prof Ng addresses in the lectures, although I confess it’s been a while since I watched those. But this is covered in the optional “back prop” section of the first assignment in C4 W1. You’re correct that even though we don’t have to update any parameters in the pooling layers themselves, we still have to propagate the gradients through those layers otherwise the learning on the earlier layers would be disabled. The way it works depends on whether it’s max pooling or average pooling. In the max pooling case, only one element is getting passed through in each filter step, so the gradients get applied through the input that creates the maximum output at each step and the other inputs are not affected. In the average pooling case, you just do the intuitive thing: you average the gradients and apply them through each of the inputs equally. Have a look at the optional ungraded back prop section of the C4 W1 Convolutional Model Step by Step exercise and that walks you through how to implement this. Of course we never end up actually using that code, since we switch to using TensorFlow in the very next assignment and it handles all the back propagation for us automatically, using the techniques that are shown in the Step by Step assignment.

Well, this answers my question. Thanks!