Why should we divide dA2(dA1) by keep_prob instead of multiply it? In my opinion, dA2 is the partial derivative of J with respect to A2, and if A2 is divided by a number, dA2 will be multiplied by the same number. Why am I wrong?

That’s not how derivatives work, right? It’s a linear operation. If we have:

h(x) = a * f(x) + b * g(x)

Where a and b are constants, then:

\displaystyle \frac {dh}{dx} = a * \frac {df}{dx} + b * \frac {dg}{dx}

Think about how that plays out when we apply the Chain Rule to calculate the gradients here.