I am not clear why the backprop for the max function works. It simply broadcasts each value in dA to each element of the array creating an array with the same value in each cell, and adds them up to the dA_prev slice. Why is a max mask then even needed in the first place?
The max value is computed for each portion (i.e. smaller window) of the input and not across the entire input. So, when performing backpropagation, we need to back propagate only for the entries that correspond to the maximum value within that portion. Please see create_mask_from_window
which returns a mask with 1s only for positions where the value is equal to the maximum value.
Thanks v much. Yes, I realized later that the dot multiply ensured that the max value was in the position from the mask denoting max position. Still not entirely clear why that is sufficient, but I can understand that knowing the position and value of the max for each window can help determine ultimately which inputs are needed for the outputs, which is useful for training. Will research more.
I may just be misinterpreting your wording there, but note that there are no dot products involved here. Note carefully how the logic works in the operation between the computed mask that Balaji points out and the gradient value. We multiply the gradient value for that position (a scalar) times the mask. Then we add that to the area of the input that is the source of the output on forward propagation. The mask is zero in the positions that do not correspond to the maximum input, so only the elements of the input equal to the maximum value are affected by the gradient.